CONTENTS

Chapter 21. You Can't Quite Call This Editing

21.1 And Why Not?

Summary Box

There are many specialized forms of editing that happen frequently enough that they sometimes want to be saved into a script. Examples of this kind of thing include:

  • fmt (Section 21.2) and related scripts (Section 21.3) for reformatting jagged lines into neat paragraphs

  • recomment (Section 21.4), a script for reformatting comment blocks within programs and scripts

  • behead (Section 21.5), a script for removing the headers from mail and news messages

  • center (Section 21.8), a script for centering lines of text in a file

In addition, there are a number of programs that provide some useful ways of modifying files but that you don't normally think of as editors:

  • split (Section 21.9) and csplit (Section 21.10) let you split a big file into smaller pieces.

  • tr (Section 21.11) lets you substitute one character for another — including non-printing characters that you specify by their octal values.

  • dd (Section 21.6, Section 21.13) lets you perform various data conversions on a file.

  • cut (Section 21.14) lets you cut columns or fields out of a file, and paste (Section 21.18) lets you put them back, perhaps in a different order.

This chapter covers all that and more.

— TOR

21.2 Neatening Text with fmt

One of the problems with fold is that it breaks text at an arbitrary column position — even if that position happens to be in the middle of a word. It's a pretty primitive utility, designed to keep long lines from printing off the edge of a line printer page, and not much more.

fmt can do a better job because it thinks in terms of language constructs like paragraphs. fmt wraps lines continuously, rather than just folding the long ones. It assumes that paragraphs end at blank lines.

You can use fmt for things like neatening lines of a mail message or a file that you're editing with vi (Section 17.28). (Emacs has its own built-in line-neatener.) It's also great for shell programming and almost any place you have lines that are too long or too short for your screen.

To make this discussion more concrete, let's imagine that you have the following paragraph:

    Most people take their  Emo Phillips  for granted.  They figure, and not
without some truth, that he is a God-given right and any government that
considers   itself a democracy would naturally provide
its citizens with this
sort of access.  But what too many of this  Gap-wearing,
Real World-watching generation  fail to realize
is that our American
forefathers, under  the  tutelage of Zog, the wizened master sage from
Zeta-Reticuli, had to fight  not only   the godless and   effete British
for our system of  self-determined government, but also  avoid the  terrors
of hynpo-death  from the dark and
unclean Draco-Repitilians.

To prepare this text for printing, you'd like to have all the lines be about 60 characters wide and remove the extra space in the lines. Although you could format this text by hand, GNU fmt can do this for you with the following command line:

% fmt -tuw 60 my_file

The -t option, short for --tagged-paragraph mode, tells fmt to preserve the paragraph's initial indent but align the rest of the lines with the left margin of the second line. The -u option, short for --uniform-spacing, squashes all the inappropriate whitespace in the lines. The final option, -w, sets the width of the output in characters. Like most UNIX commands, fmt sends its output to stdout. For our test paragraph, fmt did this:

    Most people take their Emo Phillips for granted.
They figure, and not without some truth, that he is a
God-given right and any government that considers itself a
democracy would naturally provide its citizens with this
sort of access.  But what too many of this Gap-wearing,
Real World-watching generation fail to realize is that
our American forefathers, under the tutelage of Zog,
the wizened master sage from Zeta-Reticuli, had to fight
not only the godless and effete British for our system of
self-determined government, but also avoid the terrors of
hynpo-death from the dark and unclean Draco-Repitilians.

There is one subtlety to fmt to be aware of: fmt expects sentences to end with a period, question mark, or exclamation point followed by two spaces. If your document isn't marked up according to this convention, fmt can't differentiate between sentences and abbreviations. This is a common "gotcha" that appears frequently on Usenet.

On at least one version of Unix, fmt is a disk initializer (disk formatter) command. Don't run that command accidentally! Check your online manual page and see the fmt equivalents that follow.

There are a few different versions of fmt, some fancier than others. In general, the program assumes the following:

figs/www.gif Go to http://examples.oreilly.com/upt3 for more information on: fmt

The GNU fmt is on the CD-ROM [see http://examples.oreilly.com/upt3]. There are also a couple of free versions available. Many versions of fmt have options for other structured data. The -p option (Section 21.4) reformats program source code. (If your fmt doesn't have -p, the recomment (Section 21.4) script uses standard fmt with sed to do the same thing.) The -s option breaks long lines at whitespace but doesn't join short lines to form longer ones.

Alternatively, you can make your own (Section 21.3) simple (and a little slower) version with sed and nroff. If you want to get fancy (and use some nroff and/or tbl coding), this will let you do automatically formatted text tables, bulleted lists, and much more.

—JP, TOR, and JJ

21.3 Alternatives to fmt

fmt (Section 21.2) is hard to do without once you've learned about it. Unfortunately, it's not available in some versions of Unix. You can get the GNU version from the CD-ROM [see http://examples.oreilly.com/upt3], but it's also relatively easy to emulate with sed (Section 37.4) and nroff. Using those two utilities also lets you take advantage of the more sophisticated formatting and flexibility that sed and nroff macros can give you. (If you're doing anything really fancy, like tables with tbl,[1] you might need col or colcrt to clean up nroff's output.)

Here's the script:

figs/www.gif Go to http://examples.oreilly.com/upt3 for more information on: fmt.sh

#!/bin/sh
sed '1i\
.ll 72\
.na\
.hy 0\
.pl 1' $* | nroff

The reason this is so complicated is that, by default, nroff makes some assumptions you need to change. For example, it assumes an 11-inch page (66 lines) and will add blank lines to a short file (or the end of a long file). The quick-and-dirty workaround to this is to manually put the nroff request .pl 1 (page length 1 line) at the top of the text you want to reformat. nroff also tends to justify lines; you want to turn this off with the .na request. You also want to turn off hyphenation (.hy 0), and you may want to set the line length to 72 instead of nroff's default 65, if only for consistency with the real fmt program. All these nroff requests get inserted before the first line of input by the sed 1i command.

A fancier script would take a -nn line-length option and turn it into a .ll request for nroff, etc.

Another solution to consider is Damian Conway's Text::Autoformat Perl module. It has some very sophisticated heurestics to try to figure out how text should be formatted, including bulleted and numbered lists. In its simplest form, it can be used to read from stdin and write to stdout, just as a standard Unix utility would do. You can invoke this module from the command line like this:

% perl -MText::Autoformat -e 'autoformat' < your_file_here

By default, autoformat formats only one paragraph at a time. This behavior can be changed by altering the invocation slightly:

% perl -MText::Autoformat -e 'autoformat({all =>1})'

The manpage for this module even suggests a way into integrate this into vi:

map f !Gperl -MText::Autoformat -e'autoformat'

—TOR and JJ

21.4 Clean Up Program Comment Blocks

Lines in a program's comment block usually start with one or more special characters, like this:

# line 1 of the comment
# line 2 of the comment
# line 3 of the comment
    ...

It can be a hassle to add more text to one of the comment lines in a block, because the line can get too long, which requires you to fold that line onto the next line, which means you have to work around the leading comment character(s).

The fmt (Section 21.2) program neatens lines of a text file. But the standard fmt won't help you "neaten" blocks of comments in a program: it mixes the comment characters from the starts of lines with the words. (If your fmt has the -p option, it handles this problem; there's an example below.) The recomment script is fmt for comment blocks. It's for people who write shell, awk, C, or almost any other kind of program with comment blocks several lines long.

21.4.1 The recomment Script

figs/www.gif Go to http://examples.oreilly.com/upt3 for more information on: recomment

recomment reads the lines that you feed its standard input. It looks at the first line and figures out what characters you're using to comment the line (see the $cchars variable for a list — typically SPACEs, TABs, #, or *). recomment then strips those comment characters off each line, feeds the remaining block of text to the fmt utility, and uses sed (Section 34.1) to add the comment characters again.

I usually use recomment from inside vi, with filter-through (Section 17.18) commands like:

!}recomment  reformat to the next blank line
5!!recomment reformat this line and the next 4

Normally, recomment lets fmt choose the width of the comment block (72 characters, typically). To get another width, you can do one of the following:

recomment isn't perfect, but it's usually much better than nothing! Here's the part of the script that does the work. The first two commands get the comment character(s) and count their length. The next three commands strip the comment characters, clean up the remaining comment text, and add the same comment characters to the start of all reformatted lines:

-n Section 34.3, expr Section 36.22, cut Section 21.14

# Get comment characters used on first line; store in $comment:
comment=`sed -n "1s/^\([$cchars]*\).*/\1/p" $temp`
# Count number of characters in comment character string:
cwidth=`expr "$comment" : '.*'`

# Re-format the comment block.  If $widopt set, use it:
cut -c`expr $cwidth + 1`- < $temp |     # Strip off comment leader
fmt $widopt |                           # Re-format the text, and
sed "s/^/$comment/"                     # put back comment characters

When the expr command in backquotes (Section 28.14) is expanded, it makes a command like cut -c4-.

21.4.2 fmt -p

Some versions of fmt (like the one on the CD-ROM [see http://examples.oreilly.com/upt3]) have a -p option that does the same thing. Unlike the automatic system in recomment, you have to tell fmt -p what the prefix characters are — but then it will only reformat lines with that prefix character For example, here's the start of a C++ program. The prefix character is *:

% cat load.cc
/*
 * This file, load.cc, reads an input
 * data file.
 * Each input line is added to a new node
 * of type struct Node.
 */
    ...
% fmt -p '*' load.cc
/*
 * This file, load.cc, reads an input data file.  Each input line is
 * added to a new node of type struct Node.
 */
    ...

— JP

21.5 Remove Mail/News Headers with behead

When you're saving or resending a Usenet article or mail message, you might want to the remove header lines (Subject:, Received:, and so on). This little script will handle standard input, one or many files. It writes to standard output. Here are a few examples:

mail Section 1.21

% behead msg* | mail -s "Did you see these?" fredf

Here's the script, adapted a little from the original by Arthur David Olson:

figs/www.gif Go to http://examples.oreilly.com/upt3 for more information on: behead

#! /bin/sh
case $# in
0)  exec sed '1,/^$/d' ;;
*)  for i
    do sed '1,/^$/d' "$i"
    done
    ;;
esac

The script relies on the fact that news articles and mail messages use a blank line to separate the header from the body of the message. As a result, the script simply deletes the text from the beginning up to the first blank line.

— JP

21.6 Low-Level File Butchery with dd

Want to strip off some arbitrary number of characters from the front of a file?

figs/www.gif Go to http://examples.oreilly.com/upt3 for more information on: dd

dd provides an unexpectedly easy answer. Let's say you wanted to delete the first 100 characters in a file. Here's the command that will do the trick (assuming of course that you give dd a filename with the if= option or data from a pipe):

% dd bs=100 skip=1

Or you could try:

% dd bs=1 skip=100

dd normally reads and writes data in 512-byte blocks; the input block size can be changed with the ibs= option, and the output block size with obs=. Use bs= to set both. skip= sets the number of blocks to skip at the start of the file.

Why would you want to do this? Section 21.9 gives an interesting example of reading text from standard input and writing it to a series of smaller files. Section 21.13 shows even more uses for dd.

— TOR

21.7 offset: Indent Text

Do you have a printer that starts each line too close to the left margin? You might want to indent text to make it look better on the screen or a printed page. Here's a Perl script that does that. It reads from files or standard input and writes to standard output. The default indentation is 5 spaces. For example, to send a copy of a file named graph to the lp printer, indented 12 spaces:

% offset -12 graph | lp

Here's the Perl script that does the job:

figs/www.gif Go to http://examples.oreilly.com/upt3 for more information on: offset

#!/usr/local/bin/perl

if ($ARGV[0] =~ /-[0-9]+/) {
    ($indent = $ARGV[0]) =~ s/-//;
    shift @ARGV;
} else {
    $indent = 5;
}

while (<>) {
    print " " x $indent, $_;
}

If there's an indentation amount in the first command-line argument, the dash is stripped and the value stored, then that argument is shifted away. Then a loop steps through the remaining arguments, if any (otherwise standard input is read) and outputs their text preceded by spaces. The script uses the Perl operator "string" x n, which outputs the string (in this case, a single space) n times. The Perl $_ operator contains the current input line.

— JP

21.8 Centering Lines in a File

Here's an awk script, written by Greg Ubben, that centers text across an 80-character line. If your system understands #! (Section 36.3), this script will be passed directly to awk without a shell. Otherwise, put this into a Bourne shell wrapper (Section 35.19).

figs/www.gif Go to http://examples.oreilly.com/upt3 for more information on: center

#!/usr/bin/awk -f
{
    printf "%" int(40+length($0)/2) "s\n", $0
}

For each input line, the script builds a printf command with a width specification just wide enough to center the line (which awk holds in $0). For instance, a line 60 characters wide would give a value of int(40+60/2), which is 70. That makes the following printf command:

printf %70s\n, $0

Because %s prints a string right-justified, that command gives a 10-character indentation (70 minus 60) on an 80-character line. The right end of the line is also 10 characters (80 minus 70) from the right edge of the screen.

In vi, you can use a filter-through (Section 17.18) command to center lines while you're editing. Or just use center from the command line. For example:

% center afile > afile.centered
% sort party_list | center | lp

— JP

21.9 Splitting Files at Fixed Points: split

Most versions of Unix come with a program called split whose purpose is to split large files into smaller files for tasks such as editing them in an editor that cannot handle large files, or mailing them if they are so big that some mailers will refuse to deal with them. For example, let's say you have a really big text file that you want to mail to someone:

% ls -l bigfile
-r--r--r--  1 jik        139070 Oct 15 21:02 bigfile

Running split on that file will (by default, with most versions of split) break it up into pieces that are each no more than 1000 lines long:

wc Section 16.6

% split bigfile
% ls -l
total 283
-r--r--r--  1 jik        139070 Oct 15 21:02 bigfile
-rw-rw-r--  1 jik         46444 Oct 15 21:04 xaa
-rw-rw-r--  1 jik         51619 Oct 15 21:04 xab
-rw-rw-r--  1 jik         41007 Oct 15 21:04 xac
% wc -l x*
    1000 xaa
    1000 xab
     932 xac
    2932 total

Note the default naming scheme, which is to append "aa", "ab", "ac", etc., to the letter "x" for each subsequent filename. It is possible to modify the default behavior. For example, you can make split create files that are 1500 lines long instead of 1000:

% rm x??
% split -1500 bigfile
% ls -l
total 288
-r--r--r--  1 jik        139070 Oct 15 21:02 bigfile
-rw-rw-r--  1 jik         74016 Oct 15 21:06 xaa
-rw-rw-r--  1 jik         65054 Oct 15 21:06 xab

You can also get it to use a name prefix other than "x":

% rm x??
% split -1500 bigfile bigfile.split.
% ls -l
total 288
-r--r--r--  1 jik        139070 Oct 15 21:02 bigfile
-rw-rw-r--  1 jik         74016 Oct 15 21:07 bigfile.split.aa
-rw-rw-r--  1 jik         65054 Oct 15 21:07 bigfile.split.ab

Although the simple behavior described above tends to be relatively universal, there are differences in the functionality of split on different Unix systems. There are four basic variants of split as shipped with various implementations of Unix:

  1. A split that understands only how to deal with splitting text files into chunks of n lines or less each.

  2. A split, usually called bsplit, that understands only how to deal with splitting nontext files into n-character chunks.

  3. A split that splits text files into n-line chunks, or nontext files into n-character chunks, and tries to figure out automatically whether it's working on a text file or a nontext file.

  4. A split that does either text files or nontext files but needs to be told explicitly when it is working on a nontext file.

The only way to tell which version you've got is to read the manual page for it on your system, which will also tell you the exact syntax for using it.

The problem with the third variant is that although it tries to be smart and automatically do the right thing with both text and nontext files, it sometimes guesses wrong and splits a text file as a nontext file or vice versa, with completely unsatisfactory results. Therefore, if the variant on your system is (3), you probably want to get your hands on one of the many split clones out there that is closer to one of the other variants (see below).

Variants (1) and (2) listed above are OK as far as they go, but they aren't adequate if your environment provides only one of them rather than both. If you find yourself needing to split a nontext file when you have only a text split, or needing to split a text file when you have only bsplit, you need to get one of the clones that will perform the function you need.

figs/www.gif Go to http://examples.oreilly.com/upt3 for more information on: split

Variant (4) is the most reliable and versatile of the four listed, and it is therefore what you should go with if you find it necessary to get a clone and install it on your system. There are several such clones in the various source archives, including the free BSD Unix version. GNU split is on the CD-ROM [see http://examples.oreilly.com/upt3]. Alternatively, if you have installed perl (Section 41.1), it is quite easy to write a simple split clone in perl, and you don't have to worry about compiling a C program to do it; this is an especially significant advantage if you need to run your split on multiple architectures that would need separate binaries. The Perl code for a binary split program follows:

#!/usr/bin/perl -w --
# Split text or binary files; jjohn 2/2002
use strict;
use Getopt::Std;

my %opts;
getopts("?b:f:hp:ts:", \%opts);

if( $opts{'?'} || $opts{'h'} || !-e $opts{'f'}){
  print <<USAGE;
$0 - split files in smaller ones

USAGE:
    $0 -b 1500 -f big_file -p part.

OPTIONS:

    -?       print this screen
    -h       print this screen
    -b <INT> split file into given byte size parts
    -f <TXT> the file to be split
    -p <TXT> each new file to begin with given text
    -s <INT> split file into given number of parts
USAGE
   exit;
}

my $infile;
open($infile, $opts{'f'}) or die "No file given to split\n";
binmode($infile);
my $infile_size = (stat $opts{'f'})[7];

my $block_size = 1;
if( $block_size = $opts{'b'} ){
  # chunk file into blocks

}elsif( my $total_parts = $opts{'s'} ){
  # chunk file into N parts
  $block_size = int ( $infile_size / $total_parts) + 1;

}else{
  die "Please indicate how to split file with -b or -s\n";
}

my $outfile_base = $opts{'p'} || 'part.';
my $outfile_ext = "aa";

my $offset = 0;
while( $offset < $infile_size ){
  my $buf;
  $offset += read $infile, $buf, $block_size;
  write_file($outfile_base, $outfile_ext++, \$buf);
}

#--- subs ---#
sub write_file {
  my($fname, $ext, $buf) = @_;

  my $outfile;
  open($outfile, ">$fname$ext") or die "can't open $fname$ext\n";
  binmode($outfile);
  my $wrote = syswrite $outfile, $$buf;
  my $size  = length($$buf);
  warn "WARN: wrote $wrote bytes instead of $size to $fname$ext\n"
    unless $wrote == $size;
}

Although it may seem somewhat complex at first glance, this small Perl script is cross-platform and has its own small help screen to describe its options. Briefly, it can split files into N-sized blocks (given the -b option) or, with -s, create N new segments of the original file. For a better introduction to Perl, see Chapter 42.

If you need to split a nontext file and don't feel like going to all of the trouble of finding a split clone to handle it, one standard Unix tool you can use to do the splitting is dd (Section 21.6). For example, if bigfile above were a nontext file and you wanted to split it into 20,000-byte pieces, you could do something like this:

for Section 35.21, > Section 28.12

$ ls -l bigfile
-r--r--r--  1 jik        139070 Oct 23 08:58 bigfile
$ for i in 1 2 3 4 5 6 7   #[2]
> do
>       dd of=x$i bs=20000 count=1 2>/dev/null  #[3]
> done < bigfile
$ ls -l
total 279
-r--r--r--  1 jik        139070 Oct 23 08:58 bigfile
-rw-rw-r--  1 jik         20000 Oct 23 09:00 x1
-rw-rw-r--  1 jik         20000 Oct 23 09:00 x2
-rw-rw-r--  1 jik         20000 Oct 23 09:00 x3
-rw-rw-r--  1 jik         20000 Oct 23 09:00 x4
-rw-rw-r--  1 jik         20000 Oct 23 09:00 x5
-rw-rw-r--  1 jik         20000 Oct 23 09:00 x6
-rw-rw-r--  1 jik         19070 Oct 23 09:00 x7

—JIK and JJ

21.10 Splitting Files by Context: csplit

figs/www.gif Go to http://examples.oreilly.com/upt3 for more information on: csplit

Like split (Section 21.9), csplit lets you break a file into smaller pieces, but csplit (context split) also allows the file to be broken into different-sized pieces, according to context. With csplit, you give the locations (line numbers or search patterns) at which to break each section. csplit comes with System V, but there are also free versions available.

Let's look at search patterns first. Suppose you have an outline consisting of three main sections that start on lines with the Roman numerals I., II., and III.. You could create a separate file for each section by typing:

% csplit outline /I./ /II./ /III./ 
28       number of characters in each file
415                   .
372                   .
554                   .
% ls 
outline
xx00      outline title, etc.
xx01      Section I
xx02      Section II
xx03      Section III

This command creates four new files (outline remains intact). csplit displays the character counts for each file. Note that the first file (xx00) contains any text up to but not including the first pattern, and xx01 contains the first section, as you'd expect. This is why the naming scheme begins with 00. (If outline had begun immediately with a I., xx01 would still contain Section I, but in this case xx00 would be empty.)

If you don't want to save the text that occurs before a specified pattern, use a percent sign as the pattern delimiter:

% csplit outline %I.% /II./ /III./ 
415
372
554
% ls 
outline
xx00          Section I
xx01          Section II
xx02          Section III

The preliminary text file has been suppressed, and the created files now begin where the actual outline starts (the file numbering is off, however).

Let's make some further refinements. We'll use the -s option to suppress the display of the character counts, and we'll use the -f option to specify a file prefix other than the conventional xx:

% csplit -s -f part. outline /I./ /II./ /III./
% ls
outline
part.00
part.01
part.02
part.03

There's still a slight problem, though. In search patterns, a period is a metacharacter (Section 32.21) that matches any single character, so the pattern /I./ may inadvertently match words like Introduction. We need to escape the period with a backslash; however, the backslash has meaning both to the pattern and to the shell, so in fact, we need either to use a double backslash or to surround the pattern in quotes (Section 27.12). A subtlety, yes, but one that can drive you crazy if you don't remember it. Our command line becomes:

% csplit -s -f part. outline "/I\./" /II./ /III./

You can also break a file at repeated occurrences of the same pattern. Let's say you have a file that describes 50 ways to cook a chicken, and you want each method stored in a separate file. The sections begin with headings WAY #1, WAY #2, and so on. To divide the file, use csplit's repeat argument:

% csplit -s -f cook. fifty_ways /^WAY/ "{49}"

This command splits the file at the first occurrence of WAY, and the number in braces tells csplit to repeat the split 49 more times. Note that a caret (^) (Section 32.5) is used to match the beginning of the line and the C shell requires quotes around the braces (Section 28.4). The command has created 50 files:

% ls cook.*
cook.00
cook.01
  ...
cook.48
cook.49

Quite often, when you want to split a file repeatedly, you don't know or don't care how many files will be created; you just want to make sure that the necessary number of splits takes place. In this case, it makes sense to specify a repeat count that is slightly higher than what you need (the maximum is 99). Unfortunately, if you tell csplit to create more files than it's able to, this produces an "out of range" error. Furthermore, when csplit encounters an error, it exits by removing any files it created along the way. (A bug, if you ask me.) This is where the -k option comes in. Specify -k to keep the files around, even when the "out of range" message occurs.

csplit allows you to break a file at some number of lines above or below a given search pattern. For example, to break a file at the line that is five lines below the one containing Sincerely, you could type:

% csplit -s -f letter. all_letters /Sincerely/+5

This situation might arise if you have a series of business letters strung together in one file. Each letter begins differently, but each one begins five lines after the previous letter's Sincerely line. Here's another example, adapted from AT&T's Unix User's Reference Manual:

% csplit -s -k -f routine. prog.c '%main(%' '/^}/+1' '{99}'

The idea is that the file prog.c contains a group of C routines, and we want to place each one in a separate file (routine.00, routine.01, etc.). The first pattern uses % because we want to discard anything before main. The next argument says, "Look for a closing brace at the beginning of a line (the conventional end of a routine) and split on the following line (the assumed beginning of the next routine)." Repeat this split up to 99 times, using -k to preserve the created files.[4]

The csplit command takes line-number arguments in addition to patterns. You can say:

% csplit stuff 50 373 955

to create files split at some arbitrary line numbers. In that example, the new file xx00 will have lines 1-49 (49 lines total), xx01 will have lines 50-372 (323 lines total), xx02 will have lines 373-954 (582 lines total), and xx03 will hold the rest of stuff.

csplit works like split if you repeat the argument. The command:

% csplit top_ten_list 10 "{18}"

breaks the list into 19 segments of 10 lines each.[5]

— DG

21.11 Hacking on Characters with tr

The tr command is a character translation filter, reading standard input (Section 43.1) and either deleting specific characters or substituting one character for another.

The most common use of tr is to change each character in one string to the corresponding character in a second string. (A string of consecutive ASCII characters can be represented as a hyphen-separated range.)

For example, the command:

< Section 43.1

$ tr 'A-Z' 'a-z' <  file          Berkeley version

will convert all uppercase characters in file to the equivalent lowercase characters. The result is printed on standard output.

In fact, a frequent trick I use tr for is to convert filenames from all uppercase to all lowercase. This comes up when you're dealing with files from MS-DOS or VMS that you are copying on to a Unix filesystem. To change all the files in the current directory to uppercase, try this from a Bash or Bourne shell prompt:

$ for i in `ls`; do mv $i `echo $i | tr [A-Z] [a-z]`; done

Of course, you need to be careful that there are no files that have the same name regardless of case. The GNU mv can be passed the -i flag that will make the program prompt you before overwriting an existing file. If you want to uppercase filenames, simply flip the arguments to tr. You can even apply this to an entire branch of a file system by sticking this in a find command. First, create a small shell script that can downcase a file and call it downcase:

#!/bin/sh
mv $1 `echo $1 | tr [A-Z] [a-z]`

Now you can really do some damage with find:

$ find /directory/to/be/affected -exec 'downcase' '{}' ';'

Obviously, running this programming on random directories as root is not recomended, unless you're looking to test your backup system.

figs/www.gif Go to http://examples.oreilly.com/upt3 for more information on: tr

In the System V version of tr, square brackets must surround any range of characters. That is, you have to say [a-z] instead of simply a-z. And of course, because square brackets are meaningful to the shell, you must protect them from interpretation by putting the string in quotes. The GNU tr, on the web site, is basically the System V version.

If you aren't sure which version you have, here's a test. Both tr examples below will convert the lowercase letters a through z to an uppercase A, but that's not what we're testing here. The Berkeley version also converts the input [ ] to A characters because [ ] aren't treated as range operators:

% echo '[ ]' | tr '[a-z]' A 
AA                                Berkeley version
% echo '[ ]' | tr '[a-z]' A 
[ ]                                System V version

There's one place you don't have to worry about the difference between the two versions: when you're converting one range to another range, and both ranges have the same number of characters. For example, this command works in both versions:

$ tr '[A-Z]' '[a-z]' < file 
        both versions

The Berkeley tr will convert a [ from the first string into the same character [ in the second string, and the same for the ] characters. The System V version uses the [ ] characters as range operators. In both versions, you get what you want: the range A-Z is converted to the corresponding range a-z. Again, this trick works only when both ranges have the same number of characters.

The System V version also has a nice feature: the syntax [a*n], where n is some digit, means that the string should consist of n repetitions of character "a." If n isn't specified or is 0, it is taken to be some indefinitely large number. This is useful if you don't know how many characters might be included in the first string.

As described in Section 17.18, this translation (and the reverse) can be useful from within vi for translating a string. You can also delete specific characters. The -d option deletes from the input each occurrence of one or more characters specified in a string (special characters should be placed within quotation marks to protect them from the shell). For instance, the following command passes to standard output the contents of file with all punctuation deleted (and is a great exercise in shell quoting (Section 27.12)):

$ tr -d ",.\!?;:\"\'`" < file

The -s (squeeze) option of tr removes multiple consecutive occurrences of the same character in the second argument. For example, the command:

$ tr -s " " " " < file 

will print on standard output a copy of file in which multiple spaces in sequence have been replaced with a single space.

We've also found tr useful when converting documents created on other systems for use under Unix. For example, as described in Section 1.8, tr can be used to change the carriage returns at the end of each line in a Macintosh text file into the newline Unix expects. tr allows you to specify characters as octal values by preceding the value with a backslash, so the following command does the trick:

$ tr '\015' '\012' < file.mac > file.unix

The command:

$  tr -d '\015' < pc.file

will remove the carriage return from the carriage return/newline pair that a PC file uses as a line terminator. (This command is also handy for removing the excess carriage returns from a file created with script (Section 37.7).)

—TOR, JP, and JJ

21.12 Encoding "Binary" Files into ASCII

Email transport systems were originally designed to transmit characters with a seven-bit encoding — like ASCII. This meant they could send messages with plain English text but not "binary" text, such as program files or graphics (or non-English text!), that used all of an eight-bit byte. Usenet (Section 1.21), the newsgroup system, was transmitted like email and had its same seven-bit limitations. The solution — which is still used today — is to encode eight-bit text into characters that use only the seven low bits.

The first popular solution on Unix-type systems was uuencoding. That method is mostly obsolete now (though you'll still find it used sometimes); it's been replaced by MIME encoding. The next two sections cover both of those — though we recommend avoiding uuencode like the plague.

21.12.1 uuencoding

The uuencode utility encodes eight-bit data into a seven-bit representation for sending via email or on Usenet. The recipient can use uudecode to restore the original data. Unfortunately, there are several different and incompatible versions of these two utilities. Also, uuencoded data doesn't travel well through all mail gateways — partly because uuencoding is sensitive to changes in whitespace (space and TAB) characters, and some gateways munge (change or corrupt) whitespace. So if you're encoding text for transmission, use MIME instead of uuencode whenever you can.

To create an ASCII version of a binary file, use the uuencode utility. For instance, a compressed file (Section 15.6) is definitely eight-bit; it needs encoding.

A uuencoded file (there's an example later in this article) starts with a begin line that gives the file's name; this name comes from the first argument you give the uuencode utility as it encodes a file. To make uuencode read a file directly, give the filename as the second argument. uuencode writes the encoded file to its standard output. For example, to encode the file emacs.tar.gz from your ~/tarfiles directory and store it in a file named emacs.tar.gz.uu:

% uuencode emacs.tar.gz ~/tarfiles/emacs.tar.gz > emacs.tar.gz.uu

You can then insert emacs.tar.gz.uu into a mail message and send it to someone. Of course, the ASCII-only encoding takes more space than the original binary format. The encoded file will be about one-third larger.[6]

If you'd rather, you can combine the steps above into one pipeline. Given only one command-line argument (the name of the file for the begin line), uuencode will read its standard input. Instead of creating the ~/tarfiles/emacs.tar.gz, making a second uuencoded file, then mailing that file, you can give tar the "filename" so it writes to its standard output. That feeds the archive down the pipe:[7]

mail Section 1.21

% tar cf - emacs | gzip | uuencode emacs.tar.gz | \
    mail -s "uuencoded emacs file" whoever@wherever.com

What happens when you receive a uuencoded, compressed tar file? The same thing, in reverse. You'll get a mail message that looks something like this:

From: you@whichever.ie
To: whoever@wherever.com
Subject: uuencoded emacs file

begin 644 emacs.tar.gz
M+DQ0"D%L;"!O9B!T:&5S92!P<F]B;&5M<R!C86X@8F4@<V]L=F5D(&)Y(")L
M:6YK<RPB(&$@;65C:&%N:7-M('=H:6-H"F%L;&]W<R!A(&9I;&4@=&\@:&%V
M92!T=V\@;W(@;6]R92!N86UE<RX@(%5.25@@<')O=FED97,@='=O(&1I9F9E
M<F5N= IK:6YD<R!O9B!L:6YK<SH*+DQS($(*+DQI"EQF0DAA<F0@;&EN:W-<
   ...
end

So you save the message in a file, complete with headers. Let's say you call this file mailstuff. How do you get the original files back? Use the following sequence of commands:

% uudecode mailstuff
% gunzip emacs.tar.gz
% tar xf emacs.tar

The uudecode command searches through the file, skipping From:, etc., until it sees its special begin line; it decodes the rest of the file (until the corresponding end line) and creates the file emacs.tar.gz. Then gunzip recreates your original tar file, and tar xf extracts the individual files from the archive.

Again, though, you'll be better off using MIME encoding whenever you can.

21.12.2 MIME Encoding

When MIME (Multipurpose Internet Mail Extensions) was designed in the early 1990s, one main goal was robust email communications. That meant coming up with a mail encoding scheme that would work on all platforms and get through all mail transmission paths.

Some text is "mostly ASCII": for instance, it's in a language like German or French that uses many ASCII characters plus some eight-bit characters (characters with a octal value greater than 177). The MIME standard allows that text to be minimally encoded in a way that it can be read fairly well without decoding: the quoted-printable encoding. Other text is full binary — either not designed for humans to read, or so far from ASCII that an ASCII representation would be pointless. In that case, you'll want to use the base64 encoding.

figs/www.gif Go to http://examples.oreilly.com/upt3 for more information on: mimencode, mailto

Most modern email programs automatically MIME-encode files. Unfortunately, some aren't too smart about it. The Metamail utilities come with a utility called mimencode (also named mmencode) for encoding and decoding MIME formats. Another Metamail utility, mailto, encodes and sends MIME messages directly — but let's use mimencode, partly because of the extra control it gives you.

By default, mimencode reads text from standard input, uses a base64 encoding, and writes the encoded text to standard output. If you add the -q option, mimencode uses quoted-printable encoding instead.

Unlike uuencoded messages, which contain the filename in the message body, MIME-encoded messages need information in the message header (the lines "To:", "From:", etc.). The mail utility (except an older version) doesn't let you make a message header. So let's do it directly: create a mail header with cat > (Section 11.2), create a mail body with mimencode, and send it using a common system mail transfer agent, sendmail. (You could automate this with a script, of course, but we're just demonstrating.) The MIME standard header formats are still evolving; we'll use a simple set of header fields that should do the job. Here's the setup. Let's do it first in three steps, using temporary files:

$ cat > header
From: jpeek@oreilly.com
To: jpeek@jpeek.com
Subject: base64-encoded smallfile
MIME-Version: 1.0
Content-Type: application/octet-stream; name="smallfile.tar.gz"
Content-Transfer-Encoding: base64

CTRL-d
$ tar cf - smallfile | gzip | mimencode > body
$ cat header body | /usr/lib/sendmail -t

The cat > command lets me create the header file by typing it in at the terminal; I could have used a text editor instead. One important note: the header must end with a blank line. The second command creates the body file. The third command uses cat to output the header, then the body; the message we've built is piped to sendmail, whose -t option tells it to read the addresses from the message header. You should get a message something like this:

Date: Wed, 22 Nov 2000 11:46:53 -0700
Message-Id: <200011221846.LAA18155@oreilly.com>
From: jpeek@oreilly.com
To: jpeek@jpeek.com
Subject: base64-encoded smallfile
MIME-Version: 1.0
Content-Type: application/octet-stream; name="smallfile.tar.gz"
Content-Transfer-Encoding: base64

H4sIACj6GzoAA+1Z21YbRxb1c39FWcvBMIMu3A0IBWxDzMTYDuBgrxU/lKSSVHF3V6erGiGv
rPn22edU3wRIecrMPLgfEGpVV53LPvtcOktcW6au3dnZ2mrZcfTkb7g6G53O7vb2k06ns7G3
06HPzt7uDn/Sra1N/L+32dnd29ve3tjD+s3Nna0novN3CHP/yqyTqRBPfk+U+rpknUnlf0Oc
  ...

Your mail client may be able to extract that file directly. You also can use mimencode -u. But mimencode doesn't know about mail headers, so you should strip off the header first. The behead (Section 21.5) script can do that. For instance, if you've saved the mail message in a file msg:

$ behead msg | mimencode -u > smallfile.tar.gz

Extract (Section 39.2) smallfile.tar.gz and compare it to your original smallfile (maybe with cmp). They should be identical.

If you're planning to do this often, it's important to understand how to form an email header and body properly. For more information, see relevant Internet RFCs (standards documents) and O'Reilly's Programming Internet Email by David Wood.

—JP and ML

21.13 Text Conversion with dd

Besides the other uses of dd (Section 21.6) we've covered, you also can use this versatile utility to convert:

The cbs= option must be used to specify a conversion buffer size when using block and unblock and when converting between ASCII and EBCDIC. The specified number of characters are put into the conversion buffer. For ascii and unblock conversion, trailing blanks are trimmed and a newline is added to each buffer before it is output. For ebcdic, ibm, and block, the input is padded with blanks up to the specified conversion buffer size.

— TOR

21.14 Cutting Columns or Fields

A nifty command called cut lets you select a list of columns or fields from one or more files.

You must specify either the -c option to cut by column or -f to cut by fields. (Fields are separated by tabs unless you specify a different field separator with -d. Use quotes (Section 27.12) if you want a space or other special character as the delimiter.)

In some versions of cut, the column(s) or field(s) to cut must follow the option immediately, without any space. Use a comma between separate values and a hyphen to specify a range (e.g., 1-10,15 or 20,23 or 50-).

The order of the columns and fields is ignored; the characters in each line are always output from first to last, in the order they're read from the input. For example, cut -f1,2,4 produces exactly the same output as cut -f4,2,1. If this isn't what you want, try perl (Section 41.1) or awk (Section 20.10), which let you output fields in any order.

cut is incredibly handy. Here are some examples:

Section 21.18 covers the cut counterpart, paste.

As was mentioned, you can use awk or perl to extract columns of text. Given the above task to extract the fifth and first fields fields of /etc/passwd, you can use awk:

% awk -F: '{print $5, "=>", $1}' /etc/passwd

An often forgotten command-line option for perl is -a, which puts perl in awk compatibility mode. In other words, you can get the same field-splitting behavior right from the command line:

% perl -F: -lane 'print $F[4], "=>", "$F[0]"' /etc/passwd

In the line above, perl is told about the field separator in the same way awk is, with the -F flag. The next four options are fairly common. The -l option removes newlines from input and adds a newline to all print statements. This is a real space saver for "one-line wonders," like the one above. The -a flag tells perl to split each line on the indicated field separator. If no field separator is indicated, the line is split on a space character. Each field is stored in the global array @F. Remember that the first index in a Perl array is zero. The -n option encloses the Perl code indicated by the -e to be wrapped in a loop that reads one line at a time from stdin. This little Perl snippet is useful if you need to do some additional processing with the contents of each field.

—TOR, DG, and JJ

21.15 Making Text in Columns with pr

The pr command (Section 45.6) is famous for printing a file neatly on a page — with margins at top and bottom, filename, date, and page numbers. It can also print text in columns: one file per column or many columns for each file.

The -t option takes away the heading and margins at the top and bottom of each page. That's useful when "pasting" data into columns with no interruptions.

21.15.1 One File per Column: -m

The -m option reads all files on the command line simultaneously and prints each in its own column, like this:

% pr -m -t file1 file2 file3

The lines              The lines              The lines
of file1               of file2               of file3
are here               are here               are here
  ...                    ...                    ...

pr may use TAB characters between columns. If that would be bad, you can pipe pr's output through expand. Many versions of pr have a -sX option that sets the column separator to the single character X.

By default, pr -m doesn't put filenames in the heading. If you want that, use the -h option to make your own heading. Or maybe you'd like to make a more descriptive heading. Here's an example using process substitution to compare a directory with its RCS (Section 39.5) subdirectory:

% pr -m -h "working directory compared to RCS directory" <(ls) <(ls RCS)

2000-11-22 23:57  working directory compared to RCS directory  Page    1

0001.sgm                            0001.sgm,v
0002.sgm                            0002.sgm,v
0007.sgm                            0007.sgm,v
0008.sgm                            0008.sgm,v
             ...

(The heading comes from the GNU version of pr. Later examples in this article use a different version with a different heading format.)

21.15.2 One File, Several Columns: -number

An option that's a number will print a file in that number of columns. For instance, the -3 option prints a file in three columns. The file is read, line by line, until the first column is full (by default, that takes 56 lines). Next, the second column is filled. Then, the third column is filled. If there's more of the file, the first column of page 2 is filled — and the cycle repeats:

% pr -3 file1

Nov  1 19:44 1992  file1  Page 1

Line 1 here            Line 57 here           Line 115 here
Line 2 here            Line 58 here           Line 116 here
Line 3 here            Line 59 here           Line 117 here
  ...                    ...                    ...

The columns aren't balanced — if the file will fit into one column, the other columns aren't used. You can change that by adjusting -l, the page length option; see the section below.

21.15.3 Order Lines Across Columns: -l

Do you want to arrange your data across the columns, so that the first three lines print across the top of each column, the next three lines are the second in each column, and so on, like this?

% pr -l1 -t -3 file1
Line 1 here            Line 2 here            Line 3 here
Line 4 here            Line 5 here            Line 6 here
Line 7 here            Line 8 here            Line 9 here
  ...                    ...                    ...

Use the -l1 (page length 1 line) and -t (no title) options. Each "page" will be filled by three lines (or however many columns you set). You have to use -t; otherwise, pr will silently ignore any page lengths that don't leave room for the header and footer. That's just what you want if you want data in columns with no headings.

If you want headings too, pipe the output of pr through another pr:

% pr -l1 -t -3 file1 | pr -h file1

Nov  1 19:48 1992  file1  Page 1

Line 1 here            Line 2 here            Line 3 here
Line 4 here            Line 5 here            Line 6 here
Line 7 here            Line 8 here            Line 9 here
  ...                    ...                    ...

The -h file1 puts the filename into the heading.

Also see paste (Section 21.18). Of course, programming languages like awk (Section 20.10) and perl (Section 41.1) can also make text into columns.

— JP

21.16 Make Columns Automatically with column

figs/www.gif Go to http://examples.oreilly.com/upt3 for more information on: column

Another column-making program, besides cols and pr (Section 21.15), is the creatively named utility column. It tries to determine the terminal width, which you can override with the -c option (-c 132, for example, gives 132 columns: handy for printing on wide line-printer paper.) The -x option fills columns before rows — similar to pr with its -n option and cols -d.

What makes column different from the others is its -t option. This reads input data that's already in columns and rebalances the columns into a table with variable-width columns. Say what? This is easiest to see with an example, and the column(1) manual page has a good one.

If you'd like to add column headings to ls -l output, it can be a pain to try to make headings that each take the same number of characters as the data below them. For instance, the first field on each line, the permissions, takes 10 characters, but if you want to use the heading "PERM", which takes only 4 characters, you need to balance it by adding 6 spaces after. Using column -t, you can balance these automatically. Here's an example. The first command is plain ls -l. In the second and third examples, I use sed 1d (Section 34.1) to delete the total n line from ls, and subshells (Section 24.4) to make both commands use the same standard output; this is important only in the third command, where I pipe the combined stdout to column for balancing:

; Section 28.16, > Section 28.12

$ ls -lo
total 1644
-r--r--r--    1 jpeek     1559177 Sep 19  1999 ORA_tifs.tgz
-rw-rw-r--    1 jpeek        4106 Oct 21  1999 UPT_Russian.jpg
-rw-rw-r--    1 jpeek      101944 Nov 19 09:30 london_dusk-livesights.xwd.gz
dr-xr-xr-x    2 jpeek        4096 Dec 12  1999 me
$ (echo "PERM      LINKS OWNER        SIZE MON DY TM/YR NAME"; \
> ls -lo | sed 1d)
PERM      LINKS OWNER        SIZE MON DY TM/YR NAME
-r--r--r--    1 jpeek     1559177 Sep 19  1999 ORA_tifs.tgz
-rw-rw-r--    1 jpeek        4106 Oct 21  1999 UPT_Russian.jpg
-rw-rw-r--    1 jpeek      101944 Nov 19 09:30 london_dusk-livesights.xwd.gz
dr-xr-xr-x    2 jpeek        4096 Dec 12  1999 me

$ (echo PERM LINKS OWNER SIZE MONTH DAY HH:MM/YEAR NAME; \
> ls -lo | sed 1d) | column -t
PERM        LINKS  OWNER  SIZE     MONTH  DAY  HH:MM/YEAR  NAME
-r--r--r--  1      jpeek  1559177  Sep    19   1999        ORA_tifs.tgz
-rw-rw-r--  1      jpeek  4106     Oct    21   1999        UPT_Russian.jpg
-rw-rw-r--  1      jpeek  101944   Nov    19   09:30       london_dusk-livesights.xwd.gz
dr-xr-xr-x  2      jpeek  4096     Dec    12   1999        me

My feeble attempt in the second example took a lot of trial-and-error to get the right spacing, and I still had to cram DY over the tiny sixth column and TM/YR over the seventh. In the third example, column automatically adjusted the column width to compensate for the HH:MM/YEAR heading. Unfortunately, the long filename london_dusk-livesights.xwd.gz ran off the right edge (past column 80, my window width) — but there was nothing column could do in this case because the combined header+columns were just too wide.

— JP

21.17 Straightening Jagged Columns

As we were writing this book, I decided to make a list of all the articles and the numbers of lines and characters in each, then combine that with the description, a status code, and the article's title. After a few minutes with wc -l -c (Section 16.6), cut (Section 21.14), sort (Section 22.1), and join (Section 21.19), I had a file that looked like this:

% cat messfile 
2850 2095 51441 ~BB A sed tutorial
3120 868 21259 +BB mail - lots of basics
6480 732 31034 + How to find sources - JIK's periodic posting
     ...900 lines...
5630 14 453 +JP Running Commands on Directory Stacks
1600 12 420 !JP With find, Don't Forget -print
0495 9 399 + Make 'xargs -i' use more than one filename

Yuck. It was tough to read: the columns needed to be straightened. The column (Section 21.16) command could do it automatically, but I wanted more control over the alignment of each column. A little awk (Section 20.10) script turned the mess into this:

% cat cleanfile 
2850 2095  51441 ~BB  A sed tutorial
3120  868  21259 +BB  mail - lots of basics
6480  732  31034 +    How to find sources - JIK's periodic posting
     ...900 lines...
5630   14    453 +JP  Running Commands on Directory Stacks
1600   12    420 !JP  With find, Don't Forget -print
0495    9    399 +    Make 'xargs -i' use more than one filename

Here's the simple script I used and the command I typed to run it:

% cat neatcols
{
printf "%4s %4s %6s %-4s %s\n", \
     $1, $2, $3, $4, substr($0, index($0,$5))
}
% awk -f neatcols messfile > cleanfile

You can adapt that script for whatever kinds of columns you need to clean up. In case you don't know awk, here's a quick summary:

— JP

21.18 Pasting Things in Columns

figs/www.gif Go to http://examples.oreilly.com/upt3 for more information on: cut+paste

Do you ever wish you could paste two (or even three) files side by side? You can, if you have the paste program (or the public-domain implementation on the disc).

For example, to create a three-column file from files x, y, and z:

$ paste x y z > file

To make paste read standard input, use the - option, and repeat - for every column you want. For example, to make an old ls (which lists files in a single column) list files in four columns:

$ ls | paste - - - -

The "standard input" option is also handy when used with cut (Section 21.14). You can cut data from one position on a line and paste it back on another.

The separate data streams being merged are separated by default with a tab, but you can change this with the -d option. Unlike the -d option to cut, you need not specify a single character; instead, you can specify a list of characters, which will be used in a circular fashion.

The characters in the list can be any regular character or the following escape sequences:

\n

newline

\t

tab

\\

backslash

\0

empty string

Use quoting (Section 27.12), if necessary, to protect characters from the shell.

There's also a -s option that lets you merge subsequent lines from one file. For example, to merge each pair of lines onto a single line:

$ paste -s -d"\t\n" list

Let's finish with one nice place to use process substitution, if your shell has it. You can use cut to grab certain columns from certain files, then use process substitution to make "files" that paste will read. Output those "files" into columns in any order you want. For example, to paste column 1 from file1 in the first output column, and column 3 from file2 in the second output column:

paste <(cut -f1 file1) <(cut -f3 file2)

If none of the shells on your system have process substitution, you can always use a bunch of temporary files, one file per column.

—TOR, DG, and JP

21.19 Joining Lines with join

If you've worked with databases, you'll probably know what to do with the Unix join command; see your online manual page. If you don't have a database (as far as you know!), you still probably have a use for join: combining or "joining" two column-format files. join searches certain columns in the files; when it finds columns that match one another, it "glues the lines together" at that column. This is easiest to show with an example.

I needed to summarize the information in thousands of email messages under the MH mail system. MH made that easy: it has one command (scan) that gave me almost all the information I wanted about each message and also let me specify the format I needed. But I also had to use wc -l (Section 16.6) to count the number of lines in each message. I ended up with two files, one with scan output and the other with wc output. One field in both lines was the message number; I used sort (Section 22.1) to sort the files on that field. I used awk '{print $1 "," $2}' to massage wc output into comma-separated fields. Then I used join to "glue" the two lines together on the message-number field. (Next I fed the file to a PC running dBASE, but that's another story.)

Here's the file that I told scan to output. The columns (message number, email address, comment, name, and date sent) are separated with commas (,):

0001,andrewe@isc.uci.edu,,Andy Ernbaum,19901219
0002,bc3170x@cornell.bitnet,,Zoe Doan,19910104
0003,zcode!postman@uunet.uu.net,,Head Honcho,19910105
   ...

Here's the file from wc and awk with the message number and number of lines:

0001,11
0002,5
0003,187
   ...

The following join command then joined the two files at their first columns (-t, tells join that the fields are comma-separated):

% join -t, scanfile wcfile

The output file looked like this:

0001,andrewe@isc.uci.edu,,Andy Ernbaum,19901219,11
0002,bc3170x@cornell.bitnet,,Zoe Doan,19910104,5
0003,zcode!postman@uunet.uu.net,,Head Honcho,19910105,187
   ...

figs/www.gif Go to http://examples.oreilly.com/upt3 for more information on: join

join can do a lot more than this simple example shows. See your online manual page. The GNU version of join is on the CD-ROM [see http://examples.oreilly.com/upt3].

— JP

21.20 What Is (or Isn't) Unique?

figs/www.gif Go to http://examples.oreilly.com/upt3 for more information on: uniq

uniq reads a file and compares adjacent lines (which means you'll usually want to sort the file first to be sure identical lines appear next to each other). Here's what uniq can do as it watches the input lines stream by:

Be warned:

% uniq file1 file2

will not print the unique lines from both file1 and file2 to standard output. It will replace the contents of file2 with the unique lines from file1!

Three more options control how comparisons are done:

uniq is often used as a filter. See also comm (Section 11.8), sort (Section 22.1), and especially sort -u (Section 22.6).

So what can you do with all of this?

To send only one copy of each line from list (which is typically sorted) to output file list.new:

uniq list list.new

To show which names appear more than once:

sort names | uniq -d

To show which lines appear exactly three times, search the output of uniq -c for lines that start with spaces before the digit 3 and have a tab after. (This is the way GNU uniq -c makes its output lines, at least.) In the example below, the space is marked by Λ?; the TAB is marked by tab:

grep Section 13.1

sort names | uniq -c | grep "^Λ?*3tab"

The lines don't have to be sorted; they simply have to be adjacent. For example, if you have a log file where the last few fields are repeated, you can have uniq "watch" those fields and tell you how many times they were repeated. Here we'll skip the first four fields and get a count of how many times the rest of each line was repeated:

$ cat log
Nov 21 17:20:19 powerd: down 2 volts
Nov 21 17:20:27 powerd: down 2 volts
Nov 21 17:21:15 powerd: down 2 volts
Nov 21 17:22:48 powerd: down 2 volts
Nov 21 18:18:02 powerd: up 3 volts
Nov 21 19:55:03 powerd: down 2 volts
Nov 21 19:58:41 powerd: down 2 volts
$ uniq -4 -c log
      4 Nov 21 17:20:19 powerd: down 2 volts
      1 Nov 21 18:18:02 powerd: up 3 volts
      2 Nov 21 19:55:03 powerd: down 2 volts

—JP and DG

21.21 Rotating Text

Every now and then you come across something and say, "Gee, that might come in handy someday, but I have no idea for what." This might happen to you when you're browsing at a flea market or garage sale; if you're like us, it might happen when you're browsing through public domain software.

figs/www.gif Go to http://examples.oreilly.com/upt3 for more information on: rot

Which brings us to the rot program. rot basically just rotates text columns and rows. For example, the first column below shows an input file. The other three columns show the same file fed through rot once, twice, and three times:

$ cat file

$ rot file

$ rot file | rot

$ rot file | rot | rot

abcde
54321
5
e
1
a
4
d
2
b
3
c
3
c
2
b
4
d
1
a
5
e
edcba
12345

Now let's compare combinations of rot and tail -r (Section 42.1):

$ cat file

$ rot file

$ rot file | tail -r

$ tail -r file | rot

abcde
54321
e
12345
1
a
d
a
2
b
c
b
3
c
b
c
4
d
a
d
5
e
54321
e

rot rotates the text 90 degrees. tail -r turns the text "upside down" (last line in becomes the first line out, and so forth).

rot can also rotate the output of banner to print down a page instead of across. By now, we hope you have an idea of what rot can do!

—JP and LM

[1]  [The combination of tbl, nroff, and col can make ASCII tables in a few quick steps. The tables aren't sexy, but they can be quite complex. They can be emailed or printed anywhere and, because they're plain text, don't require sophisticated viewing software or equipment. tbl is a powerful way to describe tables without worrying about balancing columns or wrapping text in them. And if you want nicer-looking output, you can feed the same tbl file to groff. — JP]

[2]  To figure out how many numbers to count up to, divide the total size of the file by the block size you want and add one if there's a remainder. The jot program can help here.

[3]  The output file size I want is denoted by the bs or "block size" parameter to dd. The 2>/dev/null (Section 36.16, Section 43.12) gets rid of dd's diagnostic output, which isn't useful here and takes up space.

[4]  In this case, the repeat can actually occur only 98 times, since we've already specified two arguments and the maximum number is 100.

[5]  Not really. The first file contains only nine lines (1-9); the rest contain 10. In this case, you're better off saying split -10 top_ten_list.

[6]  If so, why bother gzipping? Why not forget about both gzip and uuencode? Well, you can't. Remember that tar files are binary files to start with, even if every file in the archive is an ASCII text file. You'd need to uuencode a file before mailing it, anyway, so you'd still pay the 33 percent size penalty that uuencode incurs. Using gzip minimizes the damage.

[7]  With GNU tar, you can use tar czf - emacs | uuencode .... That's not the point of this example, though. We're just showing how to uuencode some arbitrary data.

CONTENTS