In this chapter, we meet Unix's most common Unix data manipulation commands.
A filter is something we are familiar with in everyday life. For instance, in cooking a sieve filters lumps out of gravy. In electronics, a filter is used to remove unwanted signal frequencies. In Unix, a filter is the term for a command that accepts standard input and alters it in some way before sending it to the standard output. Some filters remove lines from their input, some alter the lines and some change the order of the lines.
In fact, most Unix commands are filters.
The
sed
command that we saw in Chapter ten?? is an example of a very
powerful filter;
it can be programmed to remove, alter or add lines to its input.
The
more
command is a filter that slows down its input so that we have time
to read it.
Even the
wc
command is a filter - albeit a rather drastic one that
outputs only the number of lines, words and
characters contained in the input.
The
tee
command is a filter that makes no changes to its input
because its purpose is to duplicate it.
The
cat
command is another filter that makes no changes to its input;
it was designed for joining files into one logical stream.
Here is an example that uses two one-line files:
$ cat bread jam bread
wholemeal bread
strawberry jam
wholemeal bread
$
As usual with filters,
cat
's output is sent to the standard output and the input
files are unchanged.
We have to use output redirection with
cat
if we wish to join files physically.
For example:
$ cat bread jam bread > sandwich
$
puts the
bread
file's contents before and after the
jam
file's contents in another
file called
sandwich
.
Of course we can use
cat
with a single file:
$ cat jam
strawberry jam
$
Some people use
cat
instead of
more
to display files.
If the file does not fit into the
xterm
window, they can scroll back to see the start of the file.
However, if the file has more lines than
xterm
can scroll back, they have to use
more
anyway!
Beginners often start
cat
, or some other Unix command,
without giving it files to work on.
They then find that their "commands" no longer work and they
can't get a shell prompt. For example:
$ cat # do NOT do this date date ls ls nothing works nothing works - -
What has happened is that
cat
, having no files to work on, is taking the input and echoing it back.
The way out of this is to type control D
to indicate there is no more input for the command to process,
or to type control C to stop the command.
Filters are very useful when used with pipes: several filters can be used consecutively with the output of one filter piped to the input of the next as shown:
$ cat bread jam bread | grep 'rr' | tee copy | wc -l
1
$
A command like the above is known as a
pipeline.
The intermediate stages
(grep
and
tee
in the example) must be filters.
The first and last stages do not have to be filters,
but the first must produce output and the last must accept input.
A command such as
rm
that normally takes no standard input and produces no standard output would
not normally be used in a pipeline.
By default,
wc
counts the number of lines, words and characters in its input.
We can use it for counting the lines in a file:
$ wc -l bread
1 bread
$
Without the
-l
option
wc
would have give all three counts, not just the line count.
If we require two of the counts, here is how to get them:
$ wc -wl < bread
1 2
$
Notice that this time
wc
did not display the name of the file.
This is because it was the
shell
that arranged for
wc
to receive its input redirected from
bread
.
The name of the file was not given an argument to
wc
so it couldn't display it.
If we pass more than one file to
wc
:
$ wc -l bread jam
1 bread
1 jam
2 total
$
we get a total as well as the individual counts.
Most Unix commands behave like
wc
: they operate on standard input
if they are not given a filename argument.
The next command we study is an exception: it is
purely
a filter; it does not accept filename arguments.
We used
tr
in Chapter Two ?? to copy its input to its output without making any changes
but we did not see what
tr
was intended for.
In fact,
tr
does character transliterations.
Here it changes all the vowels to punctuation:
$ tr aeiou '.,:!?' the quick brown fox jumped th, q?:ck br!wn f!x j?mp,d over the lazy dog !v,r th, l.zy d!g ^D $
Notice the lines are in pairs: the first line of the pair typed by me
and the second displayed by
tr
after it has done its substitutions.
Also, note that there are two arguments, both strings.
The first string specifies which characters will be translated;
the second specifies what they will be translated to.
The first character in the first string is translated to the first
character in the second string.
Similarly, the second and following characters correspond in the same way.
We can use a shortcut to specify ranges of characters. For example:
$ tr '[a-z]' '[A-Z]' the quick brown fox jumped THE QUICK BROWN FOX JUMPED over the lazy dog OVER THE LAZY DOG ^D $
Notice that you
have
to use input/output redirection for
tr
to work with files as it is a pure filter.
Here
tr
transliterates the contents of two files:
$ cat jam bread | tr '[a-z]' '[A-Z]' > opensandwich
$
The
head
command is a filter that gives the first few lines of a file.
For example:
$ head -2 cars
The typical American male devotes more than 1,600 hours a
year to his car. He sits in it while it goes and while it
$
The argument determines the number of lines for
head
to display.
When more than one file argument is given,
head
uses the file-name as a title and outputs a blank line
between files:
$ head -1 bread jam
==> bread <==
wholemeal bread
==> jam <==
strawberry jam
$
Try
head
with the asterisk file-name generation facility to see the first lines of
all files in a directory.
The
tail
filter shows the last lines of a file.
For example:
$ tail -3 cars
society's time budget to traffic instead of 28 per cent.
Ivan Illich
$
A negative argument specifies the number of lines from the end of the file.
With a positive argument,
tail
starts displaying the specified number of lines from the
start
of the file.
Although the first argument is different, this command:
$ tail +12 cars
society's time budget to traffic instead of 28 per cent.
Ivan Illich
$
has the same effect as the previous one.
The Linux version of
tail
does not have the
+lineNumber
option.
We have to use:
-n +lineNumber
.
This:
$ tail -n +12 cars
society's time budget to traffic instead of 28 per cent.
Ivan Illich
$
is what we need.
Here we use a file containing health spending figures as the
input to
sort
:
$ sort healthspend
Austria 984
Belgium 1032
Canada 1043
Denmark 1086
France 1054
Italy 847
Japan 1173
Switzerland 1463
UK 692
USA 1372
$
Notice that
the default ordering is alphabetical using the whole line as the key.
Also, remember that
sort
is a filter - it
does not normally change its input files.
By default,
sort
uses the space character to split lines into fields
allowing us to sort with alternative keys.
Here the key is the second field on the line:
$ sort -k 2 healthspend
UK 692
USA 1372
Italy 847
Japan 1173
Canada 1043
France 1054
Austria 984
Belgium 1032
Denmark 1086
Switzerland 1463
$
But this is "wrong"!
sort
interprets two consecutive space characters as the end of
a field followed by the end of an empty field.
So, as far as
sort
is concerned, the second field on every line except the last is empty.
This makes
the countries with the shortest names come out first in our example.
We can use a
b
modifier
after the field number to allow for the leading spaces in the key:
$ sort -k 2b healthspend
Belgium 1032
Canada 1043
France 1054
Denmark 1086
Japan 1173
USA 1372
Switzerland 1463
UK 692
Italy 847
Austria 984
$
Better but still not correct!
We
have
sorted on the second field but
the numbers have been put into alphabetical,
not numerical,
order;
so
692
came after
1463
.
We can use an
n
modifier
after the field number to treat the key
as a number.
If we use an
r
modifier as well the output will be in reverse order,
as shown here:
$ sort -k 2nr healthspend
Switzerland 1463
USA 1372
Japan 1173
Denmark 1086
France 1054
Canada 1043
Belgium 1032
Austria 984
Italy 847
UK 692
$
The
b
modifier is not needed this time as the
n
modifier also allows for the leading spaces.
Britain must either be a healthy place to live, or a country without due concern for all its citizens!
We can use
sort
on numbers with decimal points too, as this file of education spending
figures shows:
$ sort -k 2nr eduspend
Canada 7.2
Denmark 6.9
Holland 6.6
Ireland 6.2
WestGermany 6.2
France 5.7
Sweden 5.7
USA 5.7
Spain 5.0
Japan 4.9
Portugal 4.9
Italy 4.8
UK 4.7
$
Britain's place at the bottom is the reason the country will be much, much less prosperous in twenty year's time.
We can specify more than one key, as this example shows:
$ sort -k 2b -k 1 eduspend
UK 4.7
Italy 4.8
Japan 4.9
Portugal 4.9
Spain 5.0
France 5.7
Sweden 5.7
USA 5.7
Ireland 6.2
WestGermany 6.2
Holland 6.6
Denmark 6.9
Canada 7.2
$
As before, the countries are being ranked by their education spending but,
this time the
-k 1
specifies that if two countries have the same figure, they should then be
sorted on the country name too.
This causes Ireland to appear before WestGermany.
Keys can begin part way through a word as in the following example:
$ sort -k 1.2f eduspend
Canada 7.2
Japan 4.9
Denmark 6.9
WestGermany 6.2
UK 4.7
Holland 6.6
Portugal 4.9
Spain 5.0
France 5.7
Ireland 6.2
USA 5.7
Italy 4.8
Sweden 5.7
$
This time, the extra
.2
says the sort key starts with the second letter of the field.
The
f
modifier after the key prevents
sort
from distinguishing between upper and lower case letters in the key,
so the UK and the USA appear in the right place.
You probably expect this to sort by the second and fourth letters of the country name:
$ sort -k 1.2f -k 1.4f eduspend
Canada 7.2
Japan 4.9
Denmark 6.9
WestGermany 6.2
UK 4.7
Holland 6.6
Portugal 4.9
Spain 5.0
France 5.7
Ireland 6.2
USA 5.7
Italy 4.8
Sweden 5.7
$
But, as you can see, it doesn't!
The reason is: by default,
sort
expects keys to continue to the end of the line;
which means that the key for Canada's line is:
anada 7.2ada 7.2
and not:
aa
as we might have expected.
When using multiple keys, we have to specify where they end. We do it like this:
$ sort -k 1.2,1.2f -k 1.4,1.4f eduspend
Canada 7.2
Japan 4.9
Denmark 6.9
WestGermany 6.2
UK 4.7
Holland 6.6
Portugal 4.9
Spain 5.0
Ireland 6.2
France 5.7
USA 5.7
Italy 4.8
Sweden 5.7
$
And, this time, we get the expected results.
The end positions are specified by adding a comma and a further key position
(,1.2
and
,1.4
) after the keys.
The following example illustrates multiple keys and a different field separator being used:
$ sort -t '.' -k 2 -k 1,1 eduspend
Spain 5.0
Canada 7.2
Ireland 6.2
WestGermany 6.2
Holland 6.6
France 5.7
Sweden 5.7
UK 4.7
USA 5.7
Italy 4.8
Denmark 6.9
Japan 4.9
Portugal 4.9
$
This time, the field separator is the full stop character;
it is specified using
sort
's
-t
option.
Therefore, the sort key is the decimal part of the number followed by
the country's name.
A subtle point: we don't need to say where the first key field ends because
sort
's assumption of the end of the line is correct in this case.
Don't forget that none of the above altered the data in
healthspend
or
eduspend
; if we wanted the sorted data left in the file we could use the
-o
option:
$ sort -k 2nr -o eduspend eduspend
$
Alternatively we could use output redirection:
$ sort -k 2nr eduspend > eduspend.sorted
$
But that needs an extra file name.
Computer users often need to extract a column of data from a file.
Unix uses the
cut
utility for this.
We can extract certain characters or fields from each line using the
-c
or
-f
option.
Either option is followed by a list of character or field numbers.
Here we get characters one, two, three and six from a file:
$ cut -c 1-3,6 healthspend
Ita
Ausi
Belu
Cana
Frae
Denr
Jap
USA3
Swie
UK 2
$
If we omit the last number in a range,
cut
goes to the end of the line, as shown here:
$ cut -c 13- healthspend
847
984
1032
1043
1054
1086
1173
1372
1463
692
$
Counting character positions is necessary when the columns are aligned with multiple spaces, like ours are.
One of the problems with these tools is that they were written at
different times by various Unix users who chose their own option letters
and default field separators.
Unlike
sort
,
cut
uses the tab character as its default separator and
d
as the option.
In our file, the fields are separated by space characters so we
need to specify that after the
-d
option:
$ cut -f 1 -d ' ' healthspend
Italy
Austria
Belgium
Canada
France
Denmark
Japan
USA
Switzerland
UK
$
to get the first field from the file.
As
cut
is a filter, we need to use redirection if we wish to do more
than just see the data:
$ cut -f 1 -d ' ' healthspend > countries $ cut -c 13- healthspend | sed 's/ //' > spending $ cut -c 1 healthspend > initials $
These three new files will be used in the next section.
Unix supplies the
paste
command to join columns of data together:
$ paste spending countries
847 Italy
984 Austria
1032 Belgium
1043 Canada
1054 France
1086 Denmark
1173 Japan
1372 USA
1463 Switzerland
692 UK
$
Notice that it too uses the tab character between columns as its default, so our second column has its left margin neatly aligned.
Again, we can choose to use another character:
$ paste -d ' ' spending countries
847 Italy
984 Austria
1032 Belgium
1043 Canada
1054 France
1086 Denmark
1173 Japan
1372 USA
1463 Switzerland
692 UK
$
It is just as easy to join three files:
$ paste -d ' :' initials countries spending
I Italy:847
A Austria:984
B Belgium:1032
C Canada:1043
F France:1054
D Denmark:1086
J Japan:1173
U USA:1372
S Switzerland:1463
U UK:692
$
If we supply more than one delimiter,
paste
will use them in rotation; if we don't supply enough,
it will recycle them.
Unix commands rarely generate unnecessary output so we shouldn't be surprised that its file comparison tool outputs nothing when it has no differences to report:
$ cmp healthspend healthspend
$
Silence!
To see the effect when there is something to report, we need to make an altered version of the file:
$ sed '/U/d' healthspend > healthspend.1 $ echo 'UK 693' >> healthspend.1 $
We do so by filtering out the UK's entry and adding a new UK entry to the end of the altered file.
(By the way,
don't worry about the
echo
command.
We will find out about it in the chapter ??.
For now it is simply an easily written way of editing a file in this book.)
When we compare the new version with the original:
$ cmp healthspend healthspend.1
healthspend healthspend.1 differ: char 120, line 8
$
cmp
tells us where the first difference occurs.
We need to use a another tool to actually
see
the differences.
The
sdiff
command shows two files side by side with the differences marked.
The
-w
option specifies the screen width to use.
$ sdiff -w 60 healthspend healthspend.1
Italy 847 Italy 847
Austria 984 Austria 984
Belgium 1032 Belgium 1032
Canada 1043 Canada 1043
France 1054 France 1054
Denmark 1086 Denmark 1086
Japan 1173 Japan 1173
USA 1372 <
Switzerland 1463 Switzerland 1463
UK 692 | UK 693
$
The less-than
(<
) and greater-than
(>
) symbols act like arrows, pointing at lines that are in one file and not
in the other.
The vertical bar
(|
) indicates that a line differs from one file to the other.
The
paste
command joins files line by line without regard to their contents.
Sometimes, we wish to merge lines that have a common field; this is known as a
database
join.
As Unix's
join
facility expects files to have been sorted on the common field,
we need to create two files sorted by country:
$ sort healthspend > shs $ sort eduspend > ses $
Then, we can
join
them:
$ join shs ses
Canada 1043 7.2
Denmark 1086 6.9
France 1054 5.7
Italy 847 4.8
Japan 1173 4.9
UK 692 4.7
USA 1372 5.7
$
As you see, only seven countries are in both files. Each line of output contains the key field followed by the data from the first file and the data from the second file.
What
join
is doing can be expressed as a diagram like this:
+-------------------+ | | | +-----------+-------+ | | | | | 1 | Common | 2 | | | Keys | | | | | | | +-----------+-------+ | | +-------------------+ Key: 1 - unmatched keys in file 1 2 - unmatched keys in file 2 DIAGRAM OF DATABASE JOIN
The seven countries that were output by the previous join, were the keys that were common to both files and correspond to the intersection area in the diagram.
Options for
join
allow us to access the lines with unmatched keys in either file;
they correspond to areas
1
and
2
in the diagram.
The
-a
option gives the lines with unmatched keys
in addition
to the lines with matched keys.
The
-v
option gives the lines with unmatched keys
instead
of the lines with matched keys.
For example, the following shows the lines with unmatched keys from the first
(shs
) file:
$ join -v 1 shs ses
Austria 984
Belgium 1032
Switzerland 1463
$
Similarly, this shows the unmatched key lines from the second
(ses
) file:
$ join -v 2 shs ses
Holland 6.6
Ireland 6.2
Portugal 4.9
Spain 5.0
Sweden 5.7
WestGermany 6.2
$
Obviously, using
-v 1
and
-v 2
together would show the unmatched key lines from both files.
If we use the
-a
option instead, this is what we get:
$ join -a 1 -a 2 shs ses
Austria 984
Belgium 1032
Canada 1043 7.2
Denmark 1086 6.9
France 1054 5.7
Holland 6.6
Ireland 6.2
Italy 847 4.8
Japan 1173 4.9
Portugal 4.9
Spain 5.0
Sweden 5.7
Switzerland 1463
UK 692 4.7
USA 1372 5.7
WestGermany 6.2
$
The
-a 1
and
-a 2
force
join
to output
all
the data from both input files.
To print a file on a printer, we use the
lp
command; this does not actually do the printing itself but sends
the request to a
spooler
that queues it until the printer is free.
In this exchange:
$ lp healthspend eduspend
request id is hbp246-29465 (2 file(s))
$
hbp246
is the name of the printer queue and
29465
is request number.
We could use those later to find out about, or cancel the request.
Of course,
lp
is able to accept standard input:
$ cat healthspend eduspend | lp
request id is hbp246-29466 (standard input)
$
This is a handy way of printing several files without starting a new page for each one.
Part of the Unix philosophy is that each tool should do just one job
as well as reasonably possible.
That is why none of them output titles or paginate their output -
that would be a second job.
In line with this,
lp
doesn't format its input in any way; it simply prints whatever is there.
Unix has a separate command,
called
pr
, which does all the fancy formatting
anyone could ask for.
If the user wants the output from a command to be formatted,
all they have to do is send the output through
pr
, usually using a pipeline.
Bear in mind that
pr
is usually used as a front-end for
lp
.
We will use it on its own in this section because sending its
results to the standard output is the easiest way to see them.
Here is
pr
's default operation on a file:
$ pr cars
Apr 3 13:34 1996 cars Page 1
The typical American male devotes more than 1,600 hours a
<ten lines deleted>
society's time budget to traffic instead of 28 per cent.
Ivan Illich
<46 lines deleted>
$
The
pr
command is set up to use 11 inch paper.
Therefore, as the
cars
file only needs one page,
pr
output 66 lines of output.
(I removed the unimportant ones.)
The first five lines are a page header with two blank lines before
and after.
The time in the header is when the file was last modified.
By default,
pr
ends each page with a footer consisting of five blank lines.
here is the usual way of paginating files before printing them:
$ pr -h 'Spending files' -l 70 -w 100 \ > aidspend eduspend healthspend | lp request id is hbp246-29577 (standard input) $
In this instance,
pr
has been given a customised header along with a special page
length and width.
When used without
lp
,
pr
's
-t
option suppresses the header and footer and prevents
pr
from spewing blank lines to fill the page.
Several thin files can be displayed side by side simultaneously with
pr
as shown here:
$ pr -t -m -w 57 aidspend eduspend healthspend
US 0.15 Canada 7.2 Italy 847
Italy 0.20 Japan 4.9 Austria 984
Ireland 0.24 Denmark 6.9 Belgium 1032
NewZealand 0.24 WestGermany 6.2 Canada 1043
Spain 0.26 UK 4.7 France 1054
Portugal 0.28 Holland 6.6 Denmark 1086
Austria 0.29 Portugal 4.9 Japan 1173
Japan 0.29 Spain 5.0 USA 1372
Belgium 0.30 France 5.7 Switzerland 1463
UK 0.30 Ireland 6.2 UK 692
Finland 0.31 USA 5.7
Germany 0.33 Italy 4.8
Switzerland 0.36 Sweden 5.7
Australia 0.38
Luxembourg 0.40
Canada 0.42
France 0.64
Netherlands 0.76
Sweden 0.90
Denmark 1.03
Norway 1.05
$
When doing multi-column output,
pr
divides the available page width (in characters) evenly between the columns.
If a file is long and thin, we can format it in several columns:
$ sort -k 2nr aidspend | pr -3 -t -w 57
Norway 1.05 Australia 0.38 Japan 0.29
Denmark 1.03 Switzerland 0.36 Portugal 0.28
Sweden 0.90 Germany 0.33 Spain 0.26
Netherlands 0.76 Finland 0.31 Ireland 0.24
France 0.64 Belgium 0.30 NewZealand 0.24
Canada 0.42 UK 0.30 Italy 0.20
Luxembourg 0.40 Austria 0.29 US 0.15
$
Note that
pr
fills up the first column before the second.
Here,
Japan is six from the bottom and the US is bottom of the aid table.
To see Unix's filter for repeated lines, we first need to duplicate
one of the lines in the
healthspend
file:
$ cp healthspend healthspend.2 $ echo 'France 1054' >> healthspend.2 $ sort -o healthspend.2 healthspend.2 $
Then we can see that
uniq
prevents
France
's entry from coming out twice:
$ uniq healthspend.2
Austria 984
Belgium 1032
Canada 1043
Denmark 1086
France 1054
Italy 847
Japan 1173
Switzerland 1463
UK 692
USA 1372
$
If we add another line for
France
to the end of the file:
$ echo 'France 1054' >> healthspend.2
$
and run
uniq
again:
$ uniq healthspend.2
Austria 984
Belgium 1032
Canada 1043
Denmark 1086
France 1054
Italy 847
Japan 1173
Switzerland 1463
UK 692
USA 1372
France 1054
$
we should notice that,
France
does now
appear twice.
The command's name is misleading:
only
consecutive
repeated lines are removed.
The
awk
command can do more complex
manipulations involving text and numbers than can be done with the other tools.
Its instructions can be so complicated that they are called a program
and have to be enclosed in quotation marks.
Here is the simplest possible example:
$ awk '/^U/' aidspend
US 0.15
UK 0.30
$
Programs for
awk
consist of one or more patterns
(or conditions) each of which can be followed by an action.
The action is carried out for all input lines that match the pattern
(or meet the condition).
In our example, the pattern is a simple regular expression and
we have omitted the action.
Without an action,
awk
does its default action, which is to display the input line.
So the program, as you can see, displays the lines beginning with
'U' from the
aidspend
file.
Here is a program with an action:
$ awk '/^U/ { print $2, $1 }' aidspend
0.15 US
0.30 UK
$
Notice that the action has to be in braces
({}
).
The
$1
and
$2
refer to the first and second fields on the line.
The example displays them in reverse order.
Because individual fields are being displayed
we only get one space character between the fields,
so the lines look shorter than when we displayed the whole line.
Here is a program with a condition instead of a pattern:
$ awk '$2 > 0.7' aidspend
Netherlands 0.76
Sweden 0.90
Denmark 1.03
Norway 1.05
$
Again, the action has been omitted, so
awk
displays the countries that spend more than the United Nation's recommended
minimum percentage of gross national product (GNP).
Here is a program with three pattern-action pairs:
$ awk '$2 > 1.0 { print $0, "(Generous)" } > /^UK / { print $0, "(UK)" } > $2 < 0.2 { print $0, "(Mean)" }' aidspend US 0.15 (Mean) UK 0.30 (UK) Denmark 1.03 (Generous) Norway 1.05 (Generous) $
Don't forget that Unix uses its other prompt
(>
) when the command spills over onto the next line as shown above.
Which means that
the prompts aren't part of the
awk
program!
Also, notice how, the actions have been aligned to make the command
as readable as possible.
Normally, Unix commands as long and as complex as the one above would not be entered interactively - they would be stored in a file using an editor. We will see how to do this in the chapter on shell scripts. For now they will be entered the hard way. If you copy them, just be sure to miss out the prompts.
Actions without patterns or conditions are performed on
all
input lines.
We use that here to calculate an average.
The first action has no pattern, so the second field of each
input line is added into the variable called
sum
:
$ awk ' { sum = sum + $2 } > END { average = sum / NR > printf "Average is %f\n", average > }' aidspend Average is 0.434762 $
END
is a special pattern whose action is performed at the end of the input file(s).
Notice that actions can be spread over more than one line, as here.
NR
is a built-in variable that holds the number of lines (records)
read from the input file.
The
printf
statement will be recognised by C programmers;
its first argument is a format string that shows the layout of the line to
be displayed.
The second and subsequent arguments are values to be slotted into the
format string
as indicated by place markers, which begin with a percent sign
(%
)
The
%f
in this example is a place marker for a real number;
the
\n
shows where a newline is needed in the output.
Notice that variables are automatically initialised to zero for numbers
and null for strings.
The next example shows a conditional
(if
) statement being used to find the longest country name occurring in
the aid files:
$ awk ' { if (length($1) > length(longest)) > longest = $1 } > END { print longest }' healthspend aidspend eduspend Switzerland $
length
is a function that returns the number of characters in its argument.
See how
awk
is given several files to work with in the example.
As a matter of fact, the same thing could be done without the
if
:
$ awk 'length($1) > length(longest) > { longest = $1 } > END { print longest }' healthspend Switzerland $
In all the
awk
scripts so far, the condition and the action have been on the same line;
that isn't essential.
In the example, the condition is in the first line and the action is in
the second.
However, the two conditions have been aligned with each other,
as have the actions, to aid readability.
The following example shows an
if-else
statement being used to compare the average spending on aid in
two sets of countries:
$ awk 'BEGIN { printf "Countries beginning with letters " } > $1 < "K" { sumA = sumA + $2 > countA = countA + 1 } > $1 > "K" { sumK = sumK + $2 > countK = countK + 1 } > END { avA = sumA / countA > avK = sumK / countK > if ( avA > avK ) > printf "A-J" > else printf "K-Z" > print " are less mean." > }' aidspend Countries beginning with letters K-Z are less mean. $
The
BEGIN
action is performed before any lines of data are read from the input file.
In this example, it is used to display a heading before processing the
input.
These examples are only a sample of what
awk
can do; it is
so powerful, whole books have been written about it.
You will have to read the manual pages very carefully to do some of these questions.
Here are some files for you to download and use in the questions: bob, bob1, cars, people, peter, peter1.
Count the words, lines and characters in the
cars
file.
Answer
wc cars
Hide
Repeat question (1) but count only the lines.
Answer
wc -l cars
Hide
Sort the contents of the
people
file into:
first name order;
Answer
sort people
Hide
surname order;
Answer
sort -k 2b people
Note the
b
.
Without it the Joans and Johns will occur first.
This is because those lines have more than one space between the first and
second names.
These are treated as spaces leading the second field.
Hide
oldest first;
Answer
sort -t ',' -k 2r people
Hide
oldest last.
Answer
sort -t ',' -k 2 people
Hide
(You are not expected to change the file -- only the order in which its lines appear on the screen.)
Put the first character of each line of the
people
file into another file.
Answer
cut -c1 people > initials
Hide
Put the surnames from the
people
file into another file.
Answer
cut -f1 -d ',' people | cut -c7- > surnames
Notice we need two
cut
commands because
-f
and
-c
are mutually exclusive.
OR
awk '{print $2}' people | sed 's/,//'
This is a bit simpler as
awk
uses any white space characters as a field
separator instead of just one character.
However, we still need something to get rid of the commas.
Hide
Create a file containing initials and surnames from the files created for questions (4) and (5).
Answer
paste initials surnames > names
Hide
The
split
command does the opposite of
cat
. Read its
man
page and use it to split the
people
file into files of four
lines each.
Answer
split -4 people part
Hide
How would you check if the files
bob
and
bob1
were identical?
Answer
cmp bob bob1
Hide
Find the differences between files
peter
and
peter1
.
Answer
sdiff peter peter1
OR (better, lazier)
sdiff peter peter1 | grep '[<|>]'
Hide
How would you check if there are any common lines in
peter
and
peter1
?
Answer
sdiff peter peter1
OR (better, lazier)
sdiff peter peter1 | grep -v '[<|>]'
OR
(using
$ apropos common
to find a better tool -- the
comm
command )
sort peter > peter.sorted sort peter1 > peter1.sorted comm -12 peter.sorted peter1.sorted rm peter.sorted peter1.sorted
Hide
Translate all the lower case letters in a file to upper-case.
Answer
tr "[a-z]" "[A-Z]" < bob
(must use
<
as
tr
is a pure filter)
Hide
sort
's
-u
option seems to provide the same facility as the
uniq
command.
Are there any circumstances when they are not equivalent?
Answer
Yes -- when the sort key is not the whole line.
uniq
filters out duplicated adjacent lines.
sort
's
-u
option filters out
lines that have the same sort key; lines may be different but
still have the same key
Hide
How would you get a printout of one of your files?
Answer
lp filename
This may not work for you if your workplace doesn't provide you with a Unix printer.
Hide
Format the
people
file into 2 columns, 40 characters wide with a
suitable heading on 12 line "pages".
Answer
pr -h "suitable heading" -2 -w 80 -l 12 people
Hide
One of the names in the
people
file is repeated. Produce
another version with unique names WITHOUT using an editor.
Answer
sort people | uniq > newpeople
OR
sort -u people > newpeople
Hide
Calculate the average age of the people in the
people
file
(When you have decided which command to use, see the examples
section of its
man
page.)
Answer
awk -F',' ' {sum += $2} END {print sum/NR}' people
Hide
Format the
people
file into two columns, but this time, put the
names across the screen and down the columns rather than down the
columns and across the page. (This question requires careful
reading of the
man
page and/or ingenuity.)
Answer
pr -a -2 -w 80 people
(using the
-a
[across] option)
OR
pr -t -l 1 -2 -w 80 people
(an ingenious solution -- suppresses page breaks and sets the page length to one)
OR
paste - - < people
(more creative still -- standard input lines are pasted alongside the next line of standard input, but the columns are not so tidy)
Hide
Run this command:
grep '\(\<[A-Za-z][a-z]*\>\).*\<\1\>' cars
What does it do? Explain why.
Answer
It finds lines that have any word repeated in the line.
[A-Za-z][a-z]*
means a string of characters, possibly with an initial capital letter.
Enclosing it in
\< \>
makes it a separate, word rather than just a string
in a longer word.
\1
means whatever the previous regular expression matched -- that is
an earlier string in the line.
\<\1\>
means the re-occurring string must also be a separate word.
Hide
In an earlier exercise you used:
who | wc -l
to get a rough count of people logged in. Improve it so that duplicate logins are not counted.
Answer
who | sort -u -k 1,1 | wc -l
The extra stage drops lines whose first field (the login code) is repeated.
Hide
wc cars
wc -l cars
sort people
sort -k 2b people
Note the
b
.
Without it the Joans and Johns will occur first.
This is because those lines have more than one space between the first and
second names.
These are treated as spaces leading the second field.
sort -t ',' -k 2r people
sort -t ',' -k 2 people
cut -c1 people > initials
cut -f1 -d ',' people | cut -c7- > surnames
Notice we need two
cut
commands because
-f
and
-c
are mutually exclusive.
OR
awk '{print $2}' people | sed 's/,//'
This is a bit simpler as
awk
uses any white space characters as a field
separator instead of just one character.
However, we still need something to get rid of the commas.
paste initials surnames > names
split -4 people part
cmp bob bob1
sdiff peter peter1
OR (better, lazier)
sdiff peter peter1 | grep '[<|>]'
sdiff peter peter1
OR (better, lazier)
sdiff peter peter1 | grep -v '[<|>]'
OR
(using
$ apropos common
to find a better tool -- the
comm
command )
sort peter > peter.sorted sort peter1 > peter1.sorted comm -12 peter.sorted peter1.sorted rm peter.sorted peter1.sorted
tr "[a-z]" "[A-Z]" < bob
(must use
<
as
tr
is a pure filter)
Yes -- when the sort key is not the whole line.
uniq
filters out duplicated adjacent lines.
sort
's
-u
option filters out
lines that have the same sort key; lines may be different but
still have the same key
lp filename
This may not work for you if your workplace doesn't provide you with a Unix printer.
pr -h "suitable heading" -2 -w 80 -l 12 people
sort people | uniq > newpeople
OR
sort -u people > newpeople
awk -F',' ' {sum += $2} END {print sum/NR}' people
pr -a -2 -w 80 people
(using the
-a
[across] option)
OR
pr -t -l 1 -2 -w 80 people
(an ingenious solution -- suppresses page breaks and sets the page length to one)
OR
paste - - < people
(more creative still -- standard input lines are pasted alongside the next line of standard input, but the columns are not so tidy)
It finds lines that have any word repeated in the line.
[A-Za-z][a-z]*
means a string of characters, possibly with an initial capital letter.
Enclosing it in
\< \>
makes it a separate, word rather than just a string
in a longer word.
\1
means whatever the previous regular expression matched -- that is
an earlier string in the line.
\<\1\>
means the re-occurring string must also be a separate word.
who | sort -u -k 1,1 | wc -l
The extra stage drops lines whose first field (the login code) is repeated.