Archives for the month of: September, 2013

2489619858_9b55cdb4ed_z-edited

Introduction

In the previous post we used the OCLC WorldCat Identities database to learn more about Frank N. Meyer, a botanist who made a USDA-sponsored expedition to South China, 1916-18. We requested that the server return information to us that had been marked up with XML, then extracted unique identifiers for other identities in the database that are linked to the record for Meyer. We also used a package called Graphviz to visualize the core of the network connecting Meyer to his associates. If you haven’t worked through that post, you should do so before trying this one.

A spider (or ‘crawler’ or ‘bot’) is a program that downloads a page from the Internet, saves some or all of the content, extracts links to other webpages, then retrieves and processes those in turn. Search engine companies employ vast numbers of spiders to maintain up-to-date maps of the web. Although spidering on the scale of the whole web is a difficult problem–and one that requires an elaborate infrastructure to solve–there are many cases when more limited spidering can play an important role in the research process. Here we will develop a surprisingly simple Bash script to explore and visualize a tiny region of the WorldCat Identities database.

Our algorithm in plain English

When coming up with a new program, it helps to alternate between top-down and bottom-up thinking. In the former case, you try to figure out what you want to accomplish in the most basic terms, then figure out how to accomplish each of your goals, sub-goals, and so on. That is top-down. At the same time, you keep in mind the stuff you already know how to do. Can you combine two simpler techniques to accomplish something more complicated? That is bottom-up.

Here is a description of what we want our spider to do:

  • repeat the following a number of times
    • get a unique identifier from a TO-DO list, make a note of it, then move it to a DONE list
    • retrieve the web page for that ID and save a copy
    • pull out any linked identifiers from the web page
    • keep track of links between the current identifier and any associated identifiers so we can visualize them
    • if any of the linked identifiers are not already in the DONE list, add them to the TO-DO list
    • pause for a while

As we look at this description of the spider, it is clear that we already know how to do some of these things. We can probably use a for loop to repeat the process a number of times. We know how to retrieve an XML webpage from the WorldCat Identities database, save a copy and extract the associated identities from it. We also have a basic idea of how to graph the resulting network with Graphviz. Let’s build our spidering script one step at a time.

The main loop

In our first version of the program, we include the for loop and use comments to sketch out the rest of the structure. Use a text editor (like atom or vi) to write the following script, save it as spider-1.sh, then change permissions to 744 with chmod and try running it.

#! /bin/bash

for i in {1..10}
do

     # if TODO list is not empty then do the following

          # get first LCCN from TODO list and store a copy

          echo "Processing $i"

          # remove LCCN from TODO list

          # append LCCN to DONE list

          # retrieve XML page for LCCN and save a local copy

          # get personal name for LCCN

          # pull out LCCNs for associated ids and get personal names

          # save links between LCCNs in GRAPH file

          # if LCCNs for assoc ids are not in DONE list, add to TODO list

          # sleep 2

done

The sleep command will pause between downloads, so we don’t hammer the OCLC server. For now, we have commented it out, however, so our tests run quickly. We don’t need to enable it until we are actually contacting their server. Note that we use indenting to help us keep track of which blocks of commands are nested inside of other blocks.

The TODO list

We will use external files to keep track of which LCCNs we have already processed, which ones we still need to process, and which links we have discovered between the various identities in the WorldCat database. Let’s start with the list of LCCNs that we want to process. We are going to keep these in a file called spider-to-do.txt. Create this file with the command

echo "lccn-n83-126466" > spider-to-do.txt

Make a copy of spider-1.sh called spider-2.sh and edit it so that it looks like the following.

#! /bin/bash

for i in {1..10}
do
     # if TODO list is not empty then do the following
       if [ -s spider-to-do.txt ]
       then

          # get first LCCN from TODO list and store a copy
          lccn=$(head -n1 spider-to-do.txt)

          echo "Processing $i, $lccn"

          # remove LCCN from TODO list
          sed -i '1d' spider-to-do.txt

          # append LCCN to DONE list
          echo $lccn >> spider-done.txt

          # retrieve XML page for LCCN and save a local copy

          # get personal name for LCCN

          # pull out LCCNs for associated ids and get personal names

          # save links between LCCNs in GRAPH file

          # if LCCNs for assoc ids not in DONE list, add to TODO list

          # sleep 2
     fi
done

Note that we have added the logic which tests to make sure that our TODO list is not empty. This uses a primary expression which will be true if the spider-to-do.txt file exists and its size is greater than zero. We have also added code to get the first LCCN in the TODO list and save a copy in a variable called lccn. Using sed and echo we remove the LCCN from the TODO list and append it to the DONE list. Finally, note that we modified the echo statement so that it tells us which LCCN the script is currently processing. Check the permissions for spider-2.sh and try executing it. Make sure that you understand that it executes the for loop ten times, but that the if statement is only true once (since there is only one entry in spider-to-do.txt. So we only see the output of echo once.

Retrieving a webpage

The next step is to retrieve the XML version of the WorldCat Identities page for the current LCCN and extract the personal name for the identity. Make a copy of spider-2.sh called spider-3.sh and modify it so it looks as follows.

#! /bin/bash

for i in {1..10}
do
     # if TODO list is not empty then do the following
       if [ -s spider-to-do.txt ]
       then

          # get first LCCN from TODO list and store a copy
          lccn=$(head -n1 spider-to-do.txt)

          echo "Processing $i, $lccn"

          # remove LCCN from TODO list
          sed -i '1d' spider-to-do.txt

          # append LCCN to DONE list
          echo $lccn >> spider-done.txt

          # retrieve XML page for LCCN and save a local copy
          wget "http://www.worldcat.org/identities/"${lccn}"/identity.xml" -O ${lccn}.xml

          # get personal name for LCCN
          currname=$(xmlstarlet sel -T -t -m "/Identity/nameInfo" -o "\"" -v "rawName/suba" -o "\"" -n ${lccn}.xml | tr -d ' ')

          echo "Current name $currname"

          # pull out LCCNs for associated ids and get personal names

          # save links between LCCNs in GRAPH file

          # if LCCNs for assoc ids not in DONE list, add to TODO list

          # sleep 2
     fi
done

As in the previous post, we use wget to retrieve the file and xmlstarlet to extract information from it. We also use the echo command to display the personal name of the LCCN we are processing.

Before we try running this version of our spider, it will be handy to have a small script to reset our spider so we can run it again. Use a text editor to enter the following script and save it as reset-spider.sh. Change the permissions to 744 and execute it, then execute spider-3.sh. Note that the reset script will notify you that some files don’t exist. That’s OK, as they will exist eventually.

#! /bin/bash

echo "lccn-n83-126466" > spider-to-do.txt
rm spider-done.txt
rm spider-graph*
rm lccn*xml

You should now have a file called lccn-n83-126466.xml which was downloaded from the WorldCat Identities database. Your spider-to-do.txt file should be empty, and your spider-done.txt file should contain the LCCN you started with. You can try resetting the spider and running it again. You should get the same results, minus a few warning messages from the reset script.

Associated identities and personal names

Next we need to extract the associated identities for the LCCN we are processing, and get personal names for each. Make a copy of spider-3.sh called spider-4.sh and edit it so that it looks like the following. As before, we use the echo command to have a look at the variables that we are creating.

#! /bin/bash

for i in {1..10}
do
     # if TODO list is not empty then do the following
       if [ -s spider-to-do.txt ]
       then

          # get first LCCN from TODO list and store a copy
          lccn=$(head -n1 spider-to-do.txt)

          echo "Processing $i, $lccn"

          # remove LCCN from TODO list
          sed -i '1d' spider-to-do.txt

          # append LCCN to DONE list
          echo $lccn >> spider-done.txt

          # retrieve XML page for LCCN and save a local copy
          wget "http://www.worldcat.org/identities/"${lccn}"/identity.xml" -O ${lccn}.xml

          # get personal name for LCCN
          currname=$(xmlstarlet sel -T -t -m "/Identity/nameInfo" -o "\"" -v "rawName/suba" -o "\"" -n ${lccn}.xml | tr -d ' ')

          echo "Current name $currname"

          # pull out LCCNs for associated ids and get personal names
          associd=$(xmlstarlet sel -T -t -m "/Identity/associatedNames/name" -v "normName" -n ${lccn}.xml | grep 'lccn')

          echo "Associated LCCNs"
          echo $associd

          assocname=$(xmlstarlet sel -T -t -m "/Identity/associatedNames/name" -o "\"" -v "rawName/suba" -o "\"" -n ${lccn}.xml | tr -d ' ')

          echo "Associated names"
          echo $assocname

          # save links between LCCNs in GRAPH file

          # if LCCNs for assoc ids not in DONE list, add to TODO list

          # sleep 2
     fi
done

The final version of the spider

We have two remaining problems that we need to solve in order to get our spider up and running. First, we want to save all of the links between the various identities in a file so that we can visualize them with graphviz. This involves looping through the assocname array with a for loop, and appending each link to a file that we are going to call spider-graph.dot. The second problem is to add LCCNs to our TODO list, but only if we haven’t already DONE them. We will use an if statement and the fgrep command to test whether the spider-done.txt file already contains an LCCN, and if not, append it to spider-to-do.txt. Copy the spider-4.sh file to a version called spider-final.sh, and edit it so that it looks as follows. Note that we are hitting the WorldCat Identities database repeatedly now, so we need to uncomment the sleep command.

#! /bin/bash

for i in {1..10}
do
     # if TODO list is not empty then do the following
       if [ -s spider-to-do.txt ]
       then

          # get first LCCN from TODO list and store a copy
          lccn=$(head -n1 spider-to-do.txt)

          echo "Processing $i, $lccn"

          # remove LCCN from TODO list
          sed -i '1d' spider-to-do.txt

          # append LCCN to DONE list
          echo $lccn >> spider-done.txt

          # retrieve XML page for LCCN and save a local copy
          wget "http://www.worldcat.org/identities/"${lccn}"/identity.xml" -O ${lccn}.xml

          # get personal name for LCCN
          currname=$(xmlstarlet sel -T -t -m "/Identity/nameInfo" -o "\"" -v "rawName/suba" -o "\"" -n ${lccn}.xml | tr -d ' ')

          echo "Current name $currname"

          # pull out LCCNs for associated ids and get personal names
          associd=$(xmlstarlet sel -T -t -m "/Identity/associatedNames/name" -v "normName" -n ${lccn}.xml | grep 'lccn')

          echo "Associated LCCNs"
          echo $associd

          assocname=$(xmlstarlet sel -T -t -m "/Identity/associatedNames/name" -o "\"" -v "rawName/suba" -o "\"" -n ${lccn}.xml | tr -d ' ')

          echo "Associated names"
          echo $assocname

          # save links between LCCNs in GRAPH file
          for a in ${assocname[@]}
          do
               echo "  "${currname}" -> "${a}";" >> spider-graph.dot
          done

          # if LCCNs for assoc ids not in DONE list, add to TODO list
          for a in ${associd[@]}
          do
               if ! fgrep -q ${a} spider-done.txt
               then
                    echo ${a} >> spider-to-do.txt
               fi
          done

          sleep 2
     fi
done

Reset the spider, then try running the final version. When it finishes running, you should have ten XML files in your directory. Use the less command to explore them, and the spider-to-do.txt, spider-done.txt and spider-graph.dot files.

Visualizing the network of identities

Now we can write a very small script to visualize the links between identities. Save the following as graph-spider.sh, change the permissions to 744 and execute it. Note that we are adding some formatting commands to our Graphviz file so that the nodes look a particular way. You can experiment with changing these to suit yourself.

#! /bin/bash

echo "digraph G{" > spider-graph-temp.dot
echo "  node [color=grey, style=filled];" >> spider-graph-temp.dot
echo "  node [fontname=\"Verdana\", size=\"20,20\"];" >> spider-graph-temp.dot
cat spider-graph.dot | sort | uniq >> spider-graph-temp.dot
echo "}" >> spider-graph-temp.dot

neato -Tpng -Goverlap=false spider-graph-temp.dot > spider-graph.png
display spider-graph.png &

The resulting network graph looks something like this:

spider-graph-2014

Why store the spider’s memory in external files?

If you have some experience with programming, you may be wondering why I chose to store the TODO and DONE lists in external files, rather than in memory in the form of Bash script variables. Note that when you finish running the spider for the first time, you have ten XML files in your current directory and a bunch of stuff in your spider-to-do.txt, spider-done.txt and spider-graph.dot files. In fact, you can resume the spidering process by simply running spider-final.sh again. New XML files will be added to your current directory, and the TODO and DONE lists and GRAPH file will all be updated accordingly. If you want to restart at any point, you can always run the reset script. If you find that your spider is getting stuck exploring part of the network that is not of interest, you can also add LCCNs to the DONE list before you start the spider. Using external files to store the state of the spider makes it very easy to restart it. This would be more difficult if the spider’s process were stored in memory instead.

4608853361_16a7a249a9_z-edited

Introduction

In the previous post, we used command line tools to manipulate and study text files that contained rows and columns of data, some numeric. These kind of files are often known as CSV files–for “comma separated values”–even though the separators may be other punctuation characters, tabs or spaces. Putting data into CSV format is one way of structuring it, while still allowing it to be stored in a human- and machine-readable file. Not all data lends itself to being laid out in rows and columns, however. A different strategy for representing structure is to provide markup in the form of tags that indicate how a region of text should be displayed, what it means, or some other associated metadata. If these metadata tags are stored in the same file as the text to which they refer, they need to be syntactically distinguished from their surroundings. That is to say, it should be perfectly clear to a human or machine reader which part of the file is text and which part is tag. Here are some examples. The sentence below has two HTML (HyperText Markup Language) tags which indicate how it should be displayed in a web browser. Note the use of angle brackets and a forward slash to indicate which is the beginning tag and which is the ending one.

This is how you indicate <em>emphasis</em>, and this is how you make something <strong>stand out</strong> from its surroundings.

In XML (Extensible Markup Language), you can create tags to represent any kind of metadata you wish.

The field notes were written by <author>Frank N. Meyer</author>.

In HTML and XML, tags should be properly nested.

<outside_tag>This is <inside_tag>properly nested</inside_tag></outside_tag>
<outside_tag>This is <inside_tag>not properly nested</outside_tag></inside_tag>

[UPDATE 2014. This post was changed slightly to reflect changes in WorldCat records that were made since the original post. The sense of the lesson is unchanged.]

Installation

Since markup files are plain text, we can use Linux command line tools like cattrsedawk and vi to work with them. It turns out to be sometimes difficult to match tags with regular expressions, however, so working with grep can be frustrating. We will install a special utility called xmlstarlet to alleviate some of these problems. Use man to see if you have xmlstarlet installed. If not, install it with

sudo aptitude install xmlstarlet

[UPDATE 2014. If you are using HistoryCrawler you will have to install xmlstarlet but not graphviz.]

Later in this post we are also going to visualize some relationships in the form of graphs, diagrams that show lines or arrows connecting points or labeled nodes. We will use the graphviz package for this. Use man to see if you have it installed. If not, install it with

sudo aptitude install graphviz

Getting an XML document and extracting elements

In previous posts (1, 2), we spent some time looking at the field notes written by Frank N. Meyer during a USDA-sponsored botanical expedition to South China, 1916-18. Here we are going to use the OCLC WorldCat Identities database to learn more about Meyer and the people with whom he was associated. OCLC, the Online Computer Library Center, is the organization  that maintains WorldCat, a union catalog of the holdings of tens of thousands of libraries worldwide. The Identities database contains records of the 30 million persons, organizations, fictitious characters, and so on that the items in WorldCat are by or about. Start your GUI, open a terminal and start the iceweasel (or Firefox) browser in the background. The Identities page for Frank N. Meyer is at http://www.worldcat.org/identities/lccn-n83-126466. Spend some time exploring the page so you know what is on it. Now, in a new browser tab, open the XML version of the same page at http://www.worldcat.org/identities/lccn-n83-126466/identity.xml. Spend some time comparing the two pages. You want to discover how information that is presented for human consumption in the regular webpage is encoded in human- and machine-readable tags in the XML page. Note that you should be able to expand and collapse XML tags in your web browser display. In iceweasel, you do this by clicking on the little minus signs beside a particular tag. Doing this will give you a better sense of how the tags are nested. Now that we have a sense of the structure of the XML document, we can try extracting some of this information using command line tools. In the terminal, use wget to download a local copy of the XML file, then use the xmlstarlet el command to get a listing of the elements in the file.

wget http://www.worldcat.org/identities/lccn-n83-126466/identity.xml
xmlstarlet el identity.xml | less

Note that for associated names, we see the following pattern repeated:

Identity/associatedNames/name
Identity/associatedNames/name/normName
Identity/associatedNames/name/rawName
Identity/associatedNames/name/rawName/suba
Identity/associatedNames/name/rawName/subb
Identity/associatedNames/name/rawName/subd

Each of these lines represents a ‘path’ to a particular element in the XML document. Looking at the XML display in iceweasel we can see how associated names are tagged. We see that the normName field contains the LCCN. This is the Library of Congress Control Number, a unique identifier. Frank N. Meyer’s LCCN is n83126466. The human-readable name is stored in the rawName/suba field, with optional information in rawName/subb. Dates are in rawName/subd.

Selecting information from an XML document

We can pull information out of an XML file using the xmlstarlet sel command. For example, if we wanted to count the number of associated names, we would type the following. The -t option tells xmlstarlet to return plain text; the -v (value) option tells it what we are looking for.

xmlstarlet sel -t -v "count(/Identity/associatedNames/name)" -n identity.xml

Here we are more interested in using xmlstarlet to parse the XML file, that is, to find information and extract it. As a first attempt, we try matching (-m) the associated names and pulling out the values (-v) of the normName fields. The -n option puts newlines where we want them.

xmlstarlet sel -t -m "/Identity/associatedNames/name" -v "normName" -n identity.xml

The output that we get looks like the following.

lccn-n79008243
lccn-n83126465
lccn-n80145310
nc-united states$government printing office
lccn-n85140459
lccn-n85335475
lccn-n50015525
np-kelsey, harlan p
lccn-n79076676
lccn-n85800990

While we are at it, we can also grab information from the rawName fields. We modify our command to do that, outputting the results in a colon-separated table. The -T option says we want to output plain text. The -o option provides our output separators. Note that we are also including escaped quotation marks around our name fields. This will help us later when we further manipulate the information we are extracting.

xmlstarlet sel -T -t -m "/Identity/associatedNames/name" -v "normName"  -o ":\"" -v "rawName/suba" -o " " -v "rawName/subb" -o "\"" -n identity.xml

Our output now looks like this:

lccn-n79008243:"United States Dept. of Agriculture"
lccn-n83126465:"Cunningham, Isabel Shipley "
lccn-n80145310:"United States Bureau of Plant Industry"
...
lccn-n85800990:"Smith, Erwin F. "

The blank spaces in the rawName fields will cause us problems later, so we are going to use tr to eliminate those. We will use grep to get rid of the entries that don’t have proper LCCNs. Finally, we will package everything up into a convenient Bash shell script. Use a text editor to create the following file, and name it get-assoc-names.sh.

#! /bin/bash

xmlstarlet sel -T -t -m "/Identity/associatedNames/name" -v "normName"  -o ":\"" -v "rawName/suba" -o " " -v "rawName/subb" -o "\"" -n $1 | grep 'lccn' | tr -d ' '

Now you can change the permissions and try executing the script as follows. The last command shows how you can use cut to pull out just the names.

chmod 744 get-assoc-names.sh
./get-assoc-names.sh identity.xml
./get-assoc-names.sh identity.xml | cut -d':' -f2

We can also write a small script to pull out the LCCN and rawName for the identity that the file is about (in this case, Frank N. Meyer). Look at the XML display in your browser again. In this case, we have to use the ‘@’ character to specify the value for a tag attribute. Use a text editor to write the following script, save it as get-name.sh, change the file permissions and try executing it.

#! /bin/bash

xmlstarlet sel -T -t -v "/Identity/pnkey" -o ":\"" -v "/Identity/nameInfo[@type='personal']/rawName/suba" -o "\"" -n $1 | tr -d ' '

Plotting a directed graph

If we want to visualize the relationships between a set of entities, one way is to create a graphical figure that shows the entities as dots or nodes, and the relationships as lines or arrows. In mathematics, this kind of figure is known as a graph (not to be confused with the other sense of the word, which usually refers to the plot of a function’s output). If the connection between two entities is directional (an arrow, rather than a line), the graph is called a digraph, or directed graph. Suppose that John Doe has some kind of relationship to Jane Doe: he might be her son, nephew, husband, uncle, Facebook friend, whatever. If we want to visualize this relationship with the Graphviz software package, we start by creating a file that looks like the following. Use a text editor to create the file and save it as example.dot.

digraph G {
     "JohnDoe" -> "JaneDoe";
}

Next we use the Graphviz neato command to convert the description of the digraph into a picture of it, and save the output as a .PNG graphics file. Finally we use the display command to show the picture (and put the process in the background using an ampersand).

neato -Tpng -Goverlap=false example.dot > example.png
display example.png &

The resulting image looks like this: example In order to lay out the graph, neato uses what is called a ‘spring model’. Imagine all of the nodes of the graphs are weights, and all of the arrows connecting them are compression springs that are trying to push the weights apart. By simulating this process, neato arrives at a figure where the nodes are separated enough to read them, but not so far as to waste space. Now suppose we want to graphically represent the relationship between the main identity in our XML file (i.e., Frank N. Meyer) and all of the identities that he is associated with. We can use a Bash script to build the digraph file automatically from the identity.xml file. We will do this in stages. We start by using the echo command to print out all of the lines of our file. Note we have to use the escape character to include one set of quotation marks inside of another. Use a text editor to create the following file and name it build-digraph.sh

#! /bin/bash

echo "digraph G {"
echo "   \"John Doe\" -> \"Jane Doe\";"
echo "}"

Change the permissions and try executing your shell script with the following commands.

chmod 744 build-digraph.sh
./build-digraph.sh

Instead of echoing the statements inside our digraph, however, we want to construct them using the xmlstarlet bash scripts that we just made. First we input the identity file on the command line and grab the name from it. Use a text editor to edit build-digraph.sh so it now looks as follows.

#! /bin/bash

NAME=$(./get-name.sh $1 | cut -d':' -f2)

echo "digraph G {"
echo "   "${NAME}" -> foobar;"
echo "}"

Try running it with

./build-digraph.sh identity.xml

Now we want to create one line in our digraph file for each associated name. This is clearly a job for the for loop. Use a text editor to edit build-digraph.sh so it now looks as follows.

#! /bin/bash

NAME=$(./get-name.sh $1 | cut -d':' -f2)

echo "digraph G {"
for ANAME in $(./get-assoc-names.sh $1 | cut -d':' -f2)
do
     echo "  "${NAME}" -> "${ANAME}";"
done
echo "}"

To see if it works as expected, try running it with

./build-digraph.sh identity.xml

Your output should look like this:

digraph G {
     "Meyer,FrankNicholas" -> "UnitedStatesDepartmentofAgriculture";
     "Meyer,FrankNicholas" -> "Cunningham,IsabelShipley"; 
     "Meyer,FrankNicholas" -> "UnitedStatesBureauofPlantIndustry";
     "Meyer,FrankNicholas" -> "UnitedStatesOfficeofForeignSeedandPlantIntroduction";
     "Meyer,FrankNicholas" -> "Fairchild,David";
     "Meyer,FrankNicholas" -> "Wilson,ErnestHenry";
     "Meyer,FrankNicholas" -> "PopulationReferenceBureau";
     "Meyer,FrankNicholas" -> "Smith,ErwinF.";
}

That looks good, so we send the output to a .dot file, run it through Graphviz neato and display.

./build-digraph.sh identity.xml > anames.dot
neato -Tpng -Goverlap=false anames.dot > anames.png
display anames.png &

If all went well, the output should look like this:

anames

Trees are graphs, too

Recall that XML files are structured so that tags are properly nested inside one another. We can visualize this containment relationship as a digraph, where we have arrows from outside_tag to inside_tag. In the next section we will use xmlstarlet to extract the structure of our XML file, then Graphviz to plot it in the form of a digraph. Instead of using neato, we will use a different Graphviz plotting routine, dot, which is more appropriate for tree-like figures. Using the -u option, we can eliminate redundant tags when we pull the elements out of an XML file with xmlstarlet el.

xmlstarlet el -u identity.xml | less

Look at the previous listing. In order to convert it into the proper form for Graphviz, we need to turn the forward slashes into arrows. We have a more tricky problem, however. We need to lose everything in a given line except the last two tags and the slash between them. Think about this until you understand why it is the case. We will use grep to pull out the last two tags, separated by a slash. In order to match one or more copies of something that is not a slash, we use the following pattern

([^/]+)

So we want a string of characters that are not a slash, followed by a slash, followed by another string of characters that are not a slash, followed by the end of the line. And we want to match only that (the -o option for extended grep). The following pipeline does what we want.

xmlstarlet el -u identity.xml | grep -E -o '([^/]+)/([^/]+)$' | less

Now we use sed to replace each slash with the correct characters for Graphviz. Our XML tags contain some colons that will be confusing to Graphviz if we leave them in the tags. We are going to translate these into underscores for the purpose of making graph labels. Try the following version of the pipeline.

xmlstarlet el -u identity.xml | grep -E -o '([^/]+)/([^/]+)$' | sed 's/\//->/g' | tr ':' '_' | less

The output looks good, so we can bundle everything into a bash script called build-xml-tree.sh

#! /bin/bash

echo "digraph G {"
for LINK in $(xmlstarlet el -u $1 | grep -E -o '([^/]+)/([^/]+)$' | sed 's/\//->/g' | tr ':' '_')
do
     echo ${LINK}";"
done
echo "}"

Try running the shell script so you can make sure the output looks right.

chmod 744 build-xml-tree.sh
./build-xml-tree.sh identity.xml | less

Finally we lay out our digraph with dot. The -Grankdir option tells Graphviz that we want to use a left-right layout rather than a top-down one. This will give us a figure that is more easily compared with our web browser display.

./build-xml-tree.sh identity.xml > xmltree.dot
dot -Tpng -Goverlap=false -Grankdir=LR xmltree.dot > xmltree.png
display xmltree.png &

The resulting digraph looks like this.

xmltree

Study this figure for a few minutes. Because it encodes information about the XML file in a different way than the XML display in the web browser does, it makes it easier to see some things. For example, it is obvious when you look at the xmltree.png figure that some tags, like oclcnum or citation, may be children of more than one parent. What else can you discover about the XML file by studying the graph visualization of it?

5334945224_e74519531b_b-edited

Introduction

In previous posts we focused on manipulating human- and machine-readible text files that contained prose. Text files are also frequently used to store tabular data or database records that are partially, primarily or completely numeric. Linux and UNIX have a wide variety of commands for manipulating these kinds of files, too. One of the most powerful of these is the Awk programming language, named after its creators Alfred Aho, Peter Weinberger and Brian Kernighan. Awk is standard on Linux and UNIX, and is designed to handle many of the tasks that you might otherwise use a spreadsheet for.

[UPDATE 2014. This post has been modified to reflect changes to the source database since it was originally written. The sense of the original has not been changed. If you are using the HistoryCrawler virtual machine, skip the Installation section and go directly to Downloading a Data Set.]

Installation

We will start by using a web browser to download a data file from an online economic botany database at Kew, the Royal Botanic Gardens. Start your windowing system and open a terminal, then check to see if you have Iceweasel installed. (That is the Debian fork of Firefox; if you are using another Linux distro you will need a web browser of some kind).

man iceweasel

If you don’t get a man page for iceweasel, then install and start it with the following.

sudo aptitude install iceweasel
iceweasel &

We will also be making use of a command line program called csvfix, which is not part of the standard Debian packages. Check to see if it is already installed by typing

csvfix

If you get a “command not found” message, you are going to have to compile the program from source, which requires the Debian build-essential package. Check to see if that is installed with

dpkg -s build-essential | grep Status

If you get a response like “Status: install ok installed”, then you don’t have to install it. Otherwise, use the following to install the build-essential package.

sudo aptitude update
sudo aptitude install build-essential

Now you can download and build the csvfix command with the following:

wget https://bitbucket.org/neilb/csvfix/get/version-1.5.tar.gz
tar -xzvf v*tar.gz
rm v*tar.gz
cd neilb*
make lin
sudo cp ./csvfix/bin/csvfix /usr/local/bin
cd ~

Downloading a data set

Go to the advanced search page of the Economic Botany database at Kew, http://apps.kew.org/ecbot/advancedSearch, and do a search for TDWG Region “China Southeast”. When I did this search I got 278 results, although it is possible the number might increase if new specimens are added to their collection. Click the button labelled “Download Results (CSV)”. This step might take a while. When the data file is ready, your browser will prompt you for a location to save the file. Just put it in your current working directory for now. If you would like to use the exact same data set as I did for this post, you can download a copy here.

Getting an overall sense of the data file

We begin by using some familiar tools to get a sense of what the downloaded file looks like. The file command tells us that there are some Windows CRLF characters in the file, so we make a backup of the original, then use the tr command to create a Linux version of the file with LF characters. The wc command tells us that the file is 305 lines in length and has more than eleven thousand words. (Since the database listed 278 results, why isn’t the file 278 or 279 lines long? This is something that we will definitely want to figure out.) We can use the head command to see that the first line is a listing of fields in our data set. These include things like “Catalogue Number”, “Artefact Name”, and “Expedition”. We save a copy of the field names in search-results-fields.csv. Finally, we can use the tail command to look at the last line in the file and see what a typical record looks like. Note that there are a number of places in this line where we see a string of commas. This tells us that those fields are blank for this particular record. When looking at a structured data set, it is useful to know if or when there may be missing data. We also see that some fields contain numbers (like Catalogue Number: 69603), some contain quote-delimited strings (like Storage Name: “Bottles, boxes etc”) and some contain text that is not in quotes (like Slide: no).

file search-results.csv
mv search-results.csv search-results-bak.csv
cat search-results-bak.csv | tr -d '\r' > search-results.csv
file search-results.csv
wc search-results.csv
head -1 search-results.csv > search-results-fields.csv
cat search-results-fields.csv
tail -1 search-results.csv

Since we have a mixture of textual and numeric data in this file, we can use some of our text analysis tricks to look at things like frequently-occuring words. The pipeline below removes numerals, converts characters to lowercase, changes commas into blank spaces, removes other punctuation, changes blank spaces into newlines, sorts the resulting list of words, compresses repeated instances into a single line and counts them, then sorts the final file in reverse numerical order. Given our search we see some frequent terms we expect to see, like “hong kong”, “chinese” and “china”. We also see that the terms “medicines”, “drugs”, “veterinary”, “medical” and “pharmacies” appear quite frequently, suggesting something about the nature of these botanical samples.

cat search-results.csv | tr -d [:digit:] | tr [:upper:] [:lower:] | tr ',' ' ' | tr -d [:punct:] | tr ' ' '\n' | sort | uniq -c | sort -nr > search-results-frequencies.txt
less search-results-frequencies.txt

When we look at word frequencies, we lose the context in which terms appear. One nice thing about working with structured data is that we can pose and answer numerical or statistical questions. Before we try this, however, we need to have each line in our data file (besides the header) correspond to a single database record.

Fixing spurious line breaks and field separators

We know that the database returned 278 results for our search, and that the first line of our file is a header containing field names. So why doesn’t our file have 279 rows? Let’s hypothesize that the first field, the Catalogue Number, is supposed to be a unique identifier. We can use the cut command to pull out the first column of the data set and have a look at it. The -d option lets us set the delimiter and the -f option lets us specify a field. The first thing we notice is that there is sometimes a quotation mark or line of text in the column, rather than the catalogue number that we are expecting. We use cut again, this time to count the number of lines that don’t contain a digit string. Recall that the -v option for egrep returns lines that do not match the pattern. There are 27 of them. If we subtract 27 (the number of lines that don’t contain a catalogue number) from 305 (the number of lines in the file), we get 278, which is exactly the number of lines we were expecting. It appears as if some of the lines in our data file may have been split into two lines during downloading (with the split usually occurring just before a quotation mark). We can confirm this by using a text editor to have a look at the file. (If you are using HistoryCrawler or another Linux distribution with LibreOffice installed, you can also use Calc to explore the file, but Calc ignores the spurious line breaks when it imports, so you won’t see the problem we are trying to fix.)

cut -d',' -f1 search-results.csv | less
cut -d',' -f1 search-results.csv | egrep -v '^[0-9].+$' | wc -l 

Let’s make a copy of the file where we have removed these spurious line breaks. There are different approaches to a problem like this one, including solutions using awk. Here we will do the following. First we use grep to make sure that our data file doesn’t contain a special character. We’re going to use ‘@’, but it could be any single character that doesn’t occur in the file. Next we change all newlines into this special character, creating a file with one long line. Using the wc command, we confirm that this process hasn’t changed the number of bytes (i.e., characters) in our file. Next we use sed to delete all of the ‘@’ characters that appear just before a non-digit. This sed pattern is complicated by the fact that we want to match any character that isn’t a digit (that is handled by [^0-9]), we need to make a copy of the part of the pattern that we don’t want to delete (which we do by enclosing it in escaped parentheses), and we need to refer back to that copy (which we do with \1). This tutorial explains all of the options we are using. We then can use the wc command to confirm that the resulting file is 26 bytes shorter than the previous one. Finally, we convert all the remaining ‘@’ characters back into newlines, and confirm that the file without spurious line breaks has the right number of lines (i.e., 279 including the header). We clean up by removing temporary files.

grep '@' search-results.csv
wc -c search-results.csv
cat search-results.csv | tr '\n' '@' > search-results-oneline.csv
wc -c search-results-oneline.csv
cat search-results-oneline.csv | sed 's/@\([^0-9]\)/\1/g' > search-results-oneline-edited.csv
wc -c search-results-oneline-edited.csv
cat search-results-oneline-edited.csv | tr '@' '\n' > search-results-edited.csv
wc search-results-edited.csv
rm search-results-oneline*

Now that we have the right number of lines in our data file, we still have a bit of a problem. The comma is used both as a field separator–hence “CSV”, which stands for “comma separated values”–and within individual fields, where it has its usual function as a punctuation mark. If we were to use a simple pattern matching approach to find commas, we would confuse these two different usages. So we need to convert the field separators into a different special character that does not naturally occur in our data. We already know that the ‘@’ is one such character, so we might as well use that for our new field separator.

Sometimes it is possible to use sed, vi or awk to fix a problem like this, but it is pretty easy to make mistakes. Unanticipated special cases can cause problems that are difficult to debug. Instead we will use csvfix to convert our field separators to the ‘@’ character. (See the csvfix manual for more detail about the options used here.) We also use the tail command to make a version of the fixed file that does not contain the first row of field names.

csvfix echo -smq -osep '@' search-results-edited.csv > search-results-fixed.csv
tail -n +2 search-results-fixed.csv > search-results-fixed-noheader.csv

Now we can test our hypothesis that the Catalogue Number field contains unique identifiers. We use cut in a pipeline to show that each Catalogue Number appears only once.

cut -d'@' -f1 search-results-fixed.csv | sort -n | uniq -c | less

Answering questions about the data

When we wanted to select the first column of data, we used the -f1 option for cut. It will be handy to have a list of column numbers and headings, so we make a file containing those. Now we know, for example, that if we want to see which plant parts are held for a particular specimen, we need to look at the 33rd field (column).

cat search-results-fields.csv | tr ',' '\n' | cat -n > search-results-field-nums.txt
less search-results-field-nums.txt

How many different kinds of plant parts are represented in this data set, and how many of each kind are there? We can use a familiar pipeline to find out, as shown in the next example. We see that wood is the most common kind of specimen (76 cases), that 47 of the records have no indication of what kind of plant matter has been collected, that 37 specimens are seeds, 32 are fruits, 30 are roots, and so on.

cut -d'@' -f33 search-results-fixed-noheader.csv | sort | uniq -c | sort -nr

Field 6, the Artefact Name, contains a more detailed description of each specimen.

cut -d'@' -f6 search-results-fixed-noheader.csv | sort | uniq -c | sort -nr | less

Here we see that a lot of the specimens are listed as CHINESE DRUGS of various kinds. We can modify the command above to pull out those records:

cut -d'@' -f6 search-results-fixed-noheader.csv | sort | uniq -c | sort -nr | grep 'CHINESE DRUGS' | less

If we want to count the number of items listed as CHINESE DRUGS, we could do it with a command like this:

cut -d'@' -f6 search-results-fixed-noheader.csv | grep 'CHINESE DRUGS' | wc -l

Answering more complicated queries with Awk programs

Often we are interested in answering questions that draw on information stored across multiple fields. For example, how many of the specimens listed as CHINESE DRUGS don’t come from Hong Kong? To answer this, we will also need to draw on information from field 13, TDWG Region. This is a good opportunity to build up an example Awk program step-by-step.

First we need to pull out the fields of interest. Instead of using cut, we can do this with awk. We use the -F option to indicate our field separator, and enclose the Awk program in single quotes. The curly braces tell awk to print the thirteenth and sixth fields for every line of the input file. We put a tab between the fields to make it easier to see where one ends and the next begins.

awk -F'@' '{ print $13, "\t", $6 }' search-results-fixed-noheader.csv

We can modify the program so that it only prints lines that match a particular pattern. In this case, we are only interested in the lines where field 6 matches ‘CHINESE DRUGS’.

awk -F'@' '$6 ~ /CHINESE DRUGS/ { print $13, "\t", $6 }' search-results-fixed-noheader.csv

And of those, we only want to see the lines where field 13 does not match ‘Hong Kong’. The double ampersand means that both patterns have to be satisfied for the action to occur.

awk -F'@' '$13 !~ /Hong Kong/ && $6 ~ /CHINESE DRUGS/ { print $13, "\t", $6 }' search-results-fixed-noheader.csv

Here is another example using Awk. In field 34, TDWG Use, we see that some of the specimens are listed as FOOD, MATERIALS, MEDICINES, POISONS or some combination thereof. We will write a short program to count the number of each. We start by counting the instances labeled as FOOD. Note that we create a variable to hold the count. This variable is incremented by 1 for each record where field 34 matches ‘FOOD’. When the program is finished going through our file line-by-line, we print the resulting sum.

awk -F'@' '$34 ~ /FOOD/ { foodsum++ } END { print "Food ", foodsum } ' search-results-fixed-noheader.csv

We need a separate variable to keep track of the count of each of the labels we are interested in. Here is how we modify the program to count lines that match ‘MATERIALS’, too.

awk -F'@' '$34 ~ /FOOD/ { foodsum++ } $34 ~ /MATERIALS/ { matsum++ } END { print "Food ", foodsum; print "Materials " matsum } ' search-results-fixed-noheader.csv

Putting the entire program on one line like this can quickly get complicated. Instead we will use a text editor to edit and save the program in a file called count-labels. Note that we are setting the field separator inside our program now, rather than as an option when we call awk.

#! /usr/bin/awk -f
BEGIN {
     FS="@"
}
$34 ~ /FOOD/ { foodsum++ }
$34 ~ /MATERIALS/ { matsum++ }
END {
     print "Food ", foodsum;
     print "Materials ", matsum
}

Then we can change the permissions and run the program with

chmod 744 count-labels
./count-labels search-results-fixed-noheader.csv

Now we can add the other labels to the program and try running it again.

#! /usr/bin/awk -f
BEGIN {
     FS="@"
}
$34 ~ /FOOD/ { foodsum++ }
$34 ~ /MATERIALS/ { matsum++ }
$34 ~ /MEDICINES/ { medsum++ }
$34 ~ /POISONS/ { poisonsum++ }
END {
     print "Food ", foodsum
     print "Materials ", matsum
     print "Medicines ", medsum
     print "Poisons ", poisonsum
}

We see that 113 of the specimens have medicinal uses, whereas only two are listed as poisons. Try writing an Awk program to list the samples that were donated before 1920, using the Donor Date field.