5334945224_e74519531b_b-edited

Introduction

In previous posts we focused on manipulating human- and machine-readible text files that contained prose. Text files are also frequently used to store tabular data or database records that are partially, primarily or completely numeric. Linux and UNIX have a wide variety of commands for manipulating these kinds of files, too. One of the most powerful of these is the Awk programming language, named after its creators Alfred Aho, Peter Weinberger and Brian Kernighan. Awk is standard on Linux and UNIX, and is designed to handle many of the tasks that you might otherwise use a spreadsheet for.

[UPDATE 2014. This post has been modified to reflect changes to the source database since it was originally written. The sense of the original has not been changed. If you are using the HistoryCrawler virtual machine, skip the Installation section and go directly to Downloading a Data Set.]

Installation

We will start by using a web browser to download a data file from an online economic botany database at Kew, the Royal Botanic Gardens. Start your windowing system and open a terminal, then check to see if you have Iceweasel installed. (That is the Debian fork of Firefox; if you are using another Linux distro you will need a web browser of some kind).

man iceweasel

If you don’t get a man page for iceweasel, then install and start it with the following.

sudo aptitude install iceweasel
iceweasel &

We will also be making use of a command line program called csvfix, which is not part of the standard Debian packages. Check to see if it is already installed by typing

csvfix

If you get a “command not found” message, you are going to have to compile the program from source, which requires the Debian build-essential package. Check to see if that is installed with

dpkg -s build-essential | grep Status

If you get a response like “Status: install ok installed”, then you don’t have to install it. Otherwise, use the following to install the build-essential package.

sudo aptitude update
sudo aptitude install build-essential

Now you can download and build the csvfix command with the following:

wget https://bitbucket.org/neilb/csvfix/get/version-1.5.tar.gz
tar -xzvf v*tar.gz
rm v*tar.gz
cd neilb*
make lin
sudo cp ./csvfix/bin/csvfix /usr/local/bin
cd ~

Downloading a data set

Go to the advanced search page of the Economic Botany database at Kew, http://apps.kew.org/ecbot/advancedSearch, and do a search for TDWG Region “China Southeast”. When I did this search I got 278 results, although it is possible the number might increase if new specimens are added to their collection. Click the button labelled “Download Results (CSV)”. This step might take a while. When the data file is ready, your browser will prompt you for a location to save the file. Just put it in your current working directory for now. If you would like to use the exact same data set as I did for this post, you can download a copy here.

Getting an overall sense of the data file

We begin by using some familiar tools to get a sense of what the downloaded file looks like. The file command tells us that there are some Windows CRLF characters in the file, so we make a backup of the original, then use the tr command to create a Linux version of the file with LF characters. The wc command tells us that the file is 305 lines in length and has more than eleven thousand words. (Since the database listed 278 results, why isn’t the file 278 or 279 lines long? This is something that we will definitely want to figure out.) We can use the head command to see that the first line is a listing of fields in our data set. These include things like “Catalogue Number”, “Artefact Name”, and “Expedition”. We save a copy of the field names in search-results-fields.csv. Finally, we can use the tail command to look at the last line in the file and see what a typical record looks like. Note that there are a number of places in this line where we see a string of commas. This tells us that those fields are blank for this particular record. When looking at a structured data set, it is useful to know if or when there may be missing data. We also see that some fields contain numbers (like Catalogue Number: 69603), some contain quote-delimited strings (like Storage Name: “Bottles, boxes etc”) and some contain text that is not in quotes (like Slide: no).

file search-results.csv
mv search-results.csv search-results-bak.csv
cat search-results-bak.csv | tr -d '\r' > search-results.csv
file search-results.csv
wc search-results.csv
head -1 search-results.csv > search-results-fields.csv
cat search-results-fields.csv
tail -1 search-results.csv

Since we have a mixture of textual and numeric data in this file, we can use some of our text analysis tricks to look at things like frequently-occuring words. The pipeline below removes numerals, converts characters to lowercase, changes commas into blank spaces, removes other punctuation, changes blank spaces into newlines, sorts the resulting list of words, compresses repeated instances into a single line and counts them, then sorts the final file in reverse numerical order. Given our search we see some frequent terms we expect to see, like “hong kong”, “chinese” and “china”. We also see that the terms “medicines”, “drugs”, “veterinary”, “medical” and “pharmacies” appear quite frequently, suggesting something about the nature of these botanical samples.

cat search-results.csv | tr -d [:digit:] | tr [:upper:] [:lower:] | tr ',' ' ' | tr -d [:punct:] | tr ' ' '\n' | sort | uniq -c | sort -nr > search-results-frequencies.txt
less search-results-frequencies.txt

When we look at word frequencies, we lose the context in which terms appear. One nice thing about working with structured data is that we can pose and answer numerical or statistical questions. Before we try this, however, we need to have each line in our data file (besides the header) correspond to a single database record.

Fixing spurious line breaks and field separators

We know that the database returned 278 results for our search, and that the first line of our file is a header containing field names. So why doesn’t our file have 279 rows? Let’s hypothesize that the first field, the Catalogue Number, is supposed to be a unique identifier. We can use the cut command to pull out the first column of the data set and have a look at it. The -d option lets us set the delimiter and the -f option lets us specify a field. The first thing we notice is that there is sometimes a quotation mark or line of text in the column, rather than the catalogue number that we are expecting. We use cut again, this time to count the number of lines that don’t contain a digit string. Recall that the -v option for egrep returns lines that do not match the pattern. There are 27 of them. If we subtract 27 (the number of lines that don’t contain a catalogue number) from 305 (the number of lines in the file), we get 278, which is exactly the number of lines we were expecting. It appears as if some of the lines in our data file may have been split into two lines during downloading (with the split usually occurring just before a quotation mark). We can confirm this by using a text editor to have a look at the file. (If you are using HistoryCrawler or another Linux distribution with LibreOffice installed, you can also use Calc to explore the file, but Calc ignores the spurious line breaks when it imports, so you won’t see the problem we are trying to fix.)

cut -d',' -f1 search-results.csv | less
cut -d',' -f1 search-results.csv | egrep -v '^[0-9].+$' | wc -l 

Let’s make a copy of the file where we have removed these spurious line breaks. There are different approaches to a problem like this one, including solutions using awk. Here we will do the following. First we use grep to make sure that our data file doesn’t contain a special character. We’re going to use ‘@’, but it could be any single character that doesn’t occur in the file. Next we change all newlines into this special character, creating a file with one long line. Using the wc command, we confirm that this process hasn’t changed the number of bytes (i.e., characters) in our file. Next we use sed to delete all of the ‘@’ characters that appear just before a non-digit. This sed pattern is complicated by the fact that we want to match any character that isn’t a digit (that is handled by [^0-9]), we need to make a copy of the part of the pattern that we don’t want to delete (which we do by enclosing it in escaped parentheses), and we need to refer back to that copy (which we do with \1). This tutorial explains all of the options we are using. We then can use the wc command to confirm that the resulting file is 26 bytes shorter than the previous one. Finally, we convert all the remaining ‘@’ characters back into newlines, and confirm that the file without spurious line breaks has the right number of lines (i.e., 279 including the header). We clean up by removing temporary files.

grep '@' search-results.csv
wc -c search-results.csv
cat search-results.csv | tr '\n' '@' > search-results-oneline.csv
wc -c search-results-oneline.csv
cat search-results-oneline.csv | sed 's/@\([^0-9]\)/\1/g' > search-results-oneline-edited.csv
wc -c search-results-oneline-edited.csv
cat search-results-oneline-edited.csv | tr '@' '\n' > search-results-edited.csv
wc search-results-edited.csv
rm search-results-oneline*

Now that we have the right number of lines in our data file, we still have a bit of a problem. The comma is used both as a field separator–hence “CSV”, which stands for “comma separated values”–and within individual fields, where it has its usual function as a punctuation mark. If we were to use a simple pattern matching approach to find commas, we would confuse these two different usages. So we need to convert the field separators into a different special character that does not naturally occur in our data. We already know that the ‘@’ is one such character, so we might as well use that for our new field separator.

Sometimes it is possible to use sed, vi or awk to fix a problem like this, but it is pretty easy to make mistakes. Unanticipated special cases can cause problems that are difficult to debug. Instead we will use csvfix to convert our field separators to the ‘@’ character. (See the csvfix manual for more detail about the options used here.) We also use the tail command to make a version of the fixed file that does not contain the first row of field names.

csvfix echo -smq -osep '@' search-results-edited.csv > search-results-fixed.csv
tail -n +2 search-results-fixed.csv > search-results-fixed-noheader.csv

Now we can test our hypothesis that the Catalogue Number field contains unique identifiers. We use cut in a pipeline to show that each Catalogue Number appears only once.

cut -d'@' -f1 search-results-fixed.csv | sort -n | uniq -c | less

Answering questions about the data

When we wanted to select the first column of data, we used the -f1 option for cut. It will be handy to have a list of column numbers and headings, so we make a file containing those. Now we know, for example, that if we want to see which plant parts are held for a particular specimen, we need to look at the 33rd field (column).

cat search-results-fields.csv | tr ',' '\n' | cat -n > search-results-field-nums.txt
less search-results-field-nums.txt

How many different kinds of plant parts are represented in this data set, and how many of each kind are there? We can use a familiar pipeline to find out, as shown in the next example. We see that wood is the most common kind of specimen (76 cases), that 47 of the records have no indication of what kind of plant matter has been collected, that 37 specimens are seeds, 32 are fruits, 30 are roots, and so on.

cut -d'@' -f33 search-results-fixed-noheader.csv | sort | uniq -c | sort -nr

Field 6, the Artefact Name, contains a more detailed description of each specimen.

cut -d'@' -f6 search-results-fixed-noheader.csv | sort | uniq -c | sort -nr | less

Here we see that a lot of the specimens are listed as CHINESE DRUGS of various kinds. We can modify the command above to pull out those records:

cut -d'@' -f6 search-results-fixed-noheader.csv | sort | uniq -c | sort -nr | grep 'CHINESE DRUGS' | less

If we want to count the number of items listed as CHINESE DRUGS, we could do it with a command like this:

cut -d'@' -f6 search-results-fixed-noheader.csv | grep 'CHINESE DRUGS' | wc -l

Answering more complicated queries with Awk programs

Often we are interested in answering questions that draw on information stored across multiple fields. For example, how many of the specimens listed as CHINESE DRUGS don’t come from Hong Kong? To answer this, we will also need to draw on information from field 13, TDWG Region. This is a good opportunity to build up an example Awk program step-by-step.

First we need to pull out the fields of interest. Instead of using cut, we can do this with awk. We use the -F option to indicate our field separator, and enclose the Awk program in single quotes. The curly braces tell awk to print the thirteenth and sixth fields for every line of the input file. We put a tab between the fields to make it easier to see where one ends and the next begins.

awk -F'@' '{ print $13, "\t", $6 }' search-results-fixed-noheader.csv

We can modify the program so that it only prints lines that match a particular pattern. In this case, we are only interested in the lines where field 6 matches ‘CHINESE DRUGS’.

awk -F'@' '$6 ~ /CHINESE DRUGS/ { print $13, "\t", $6 }' search-results-fixed-noheader.csv

And of those, we only want to see the lines where field 13 does not match ‘Hong Kong’. The double ampersand means that both patterns have to be satisfied for the action to occur.

awk -F'@' '$13 !~ /Hong Kong/ && $6 ~ /CHINESE DRUGS/ { print $13, "\t", $6 }' search-results-fixed-noheader.csv

Here is another example using Awk. In field 34, TDWG Use, we see that some of the specimens are listed as FOOD, MATERIALS, MEDICINES, POISONS or some combination thereof. We will write a short program to count the number of each. We start by counting the instances labeled as FOOD. Note that we create a variable to hold the count. This variable is incremented by 1 for each record where field 34 matches ‘FOOD’. When the program is finished going through our file line-by-line, we print the resulting sum.

awk -F'@' '$34 ~ /FOOD/ { foodsum++ } END { print "Food ", foodsum } ' search-results-fixed-noheader.csv

We need a separate variable to keep track of the count of each of the labels we are interested in. Here is how we modify the program to count lines that match ‘MATERIALS’, too.

awk -F'@' '$34 ~ /FOOD/ { foodsum++ } $34 ~ /MATERIALS/ { matsum++ } END { print "Food ", foodsum; print "Materials " matsum } ' search-results-fixed-noheader.csv

Putting the entire program on one line like this can quickly get complicated. Instead we will use a text editor to edit and save the program in a file called count-labels. Note that we are setting the field separator inside our program now, rather than as an option when we call awk.

#! /usr/bin/awk -f
BEGIN {
     FS="@"
}
$34 ~ /FOOD/ { foodsum++ }
$34 ~ /MATERIALS/ { matsum++ }
END {
     print "Food ", foodsum;
     print "Materials ", matsum
}

Then we can change the permissions and run the program with

chmod 744 count-labels
./count-labels search-results-fixed-noheader.csv

Now we can add the other labels to the program and try running it again.

#! /usr/bin/awk -f
BEGIN {
     FS="@"
}
$34 ~ /FOOD/ { foodsum++ }
$34 ~ /MATERIALS/ { matsum++ }
$34 ~ /MEDICINES/ { medsum++ }
$34 ~ /POISONS/ { poisonsum++ }
END {
     print "Food ", foodsum
     print "Materials ", matsum
     print "Medicines ", medsum
     print "Poisons ", poisonsum
}

We see that 113 of the specimens have medicinal uses, whereas only two are listed as poisons. Try writing an Awk program to list the samples that were donated before 1920, using the Donor Date field.