In earlier posts we used a variety of tools to locate and contextualize words and phrases in texts, including regular expressions, concordances and search engines. In every case, however, we had to have some idea of what we were looking for. This can be a problem in exploratory research, because you usually don’t know what you don’t know. Of course it is always possible to read or skim through moderate amounts of text, but that approach doesn’t scale up to massive amounts of text. In any event, our goal is to save our care and attention for the tasks that actually require it, and to use the computer for everything else. In this post we will be using named entity recognition software from the Stanford Natural Language Processing group to automatically find people, places and organizations mentioned in a text.
In this post, we are going to be working with a book-length collection of correspondence from the Internet Archive. The digital text was created with optical character recognition (OCR) rather than being entered by hand, so there will be a number of errors in it. Download the file.
The recognizer is written in Java, so we need to make sure that is installed first. Try
If you don’t get a man page in return, install the Java runtime and development kits.
sudo aptitude update sudo aptitude upgrade sudo aptitude install default-jre default-jdk
Confirm that you have Java version 1.6 or greater with
You will also need utilities for working with zipped files. Try
If you don’t get a man page, you will need to install the following.
sudo aptitude install zip unzip
Now you can download the named entity recognition software. The version that I installed was 3.2.0 (June 20, 2013). If a newer version is available, you will have to make slight adjustments to the following commands
wget http://nlp.stanford.edu/software/stanford-ner-2013-06-20.zip unzip stanford*.zip rm stanford*.zip mv stanford* stanford-ner
Labeling named entities
Now we can run the named entity recognizer on our text. The software goes through every word in our text and tries to figure out if it is a person, organization or location. If it thinks it is a member of one of those categories, it will append a /PERSON, /ORGANIZATION or /LOCATION tag respectively. Otherwise, it appends a /O tag. (Note that this is an uppercase letter O, not a zero). There are more than 80,000 words in our text, so this takes a few minutes.
stanford-ner/ner.sh CAT31091908_djvu.txt > CAT31091908_djvu_ner.txt
Spend some time looking at the labelled text with
Our next step is to create a cleaner version of the labeled file by removing the /O tags. Recall that the sed command ‘s/before/after/g’ changes all instances of the before pattern to the after pattern. The pattern that we are trying to match has a forward slash in it, so we have to precede that with a backslash to escape it. We don’t want to match the capital O in the /ORGANIZATION tag, either, so we have to be careful to specify the blank spaces in both before and after patterns. The statement that we want looks like this
sed 's/\/O / /g' < CAT31091908_djvu_ner.txt > CAT31091908_djvu_ner_clean.txt
You can use less to confirm that we removed the /O tags but left the other ones untouched.
Matching person tags
Next we will use regular expressions to create a list of persons mentioned in the text (and recognized as such by our named entity recognizer). The recognizer is by no means perfect: it will miss some people and misclassify some words as personal names. Our text is also full of spelling and formatting errors that are a result of the OCR process. Nevertheless, we will see that the results are surprisingly useful.
The regular expression that we need to match persons is going to be complicated, so we will build it up in a series of stages. We start by creating an alias for egrep
alias egrepmatch='egrep --color -f pattr'
and we create a file called pattr that contains
Now when we run
we see all of the places in our text where the name Meyer appears. But we also want to match the tag itself, so we edit pattr so it contains the following and rerun our egrepmatch command.
That doesn’t match any name except Meyer. Let’s change pattr to use character classes to match all the other personal names. Re-run the egrepmatch command each time you change pattr.
That is better, but it is missing the middle initial in forms like Frank N. Meyer. So we have to say we want to match either alphabetical characters or a period. Now pattr looks like the following. (Because the period is a special character in regular expressions we have to escape it with a backslash.)
That looks pretty good. The next problem is a bit more subtle. If we stick with the pattern that we have now, it is going to treat Henry/PERSON Hicks/PERSON as two separate person names, rather than grouping them together. What we need to capture is the idea that a string of words tagged with /PERSON are clumped together to form a larger unit. First we modify pattr so that each matching term is followed either by whitespace or by an end of line character.
The last step is to indicate that one or more of these words tagged with /PERSON makes up a personal name. Our final pattern is as follows.
Copy this file to a new file called personpattr.
Extracting a list of personal names and counting them
To make a list of persons in our text, we run egrep with the -o option, which omits everything except the matching pattern. Run the following command and then explore CAT31091908_djvu_ner_pers.txt with less. Note that it includes names like David Fairchild and E. E. Wilson, which is what we wanted.
egrep -o -f personpattr CAT31091908_djvu_ner_clean.txt > CAT31091908_djvu_ner_pers.txt
Next we use a familiar pipeline to create frequency counts for each person labeled in our text.
cat CAT31091908_djvu_ner_pers.txt | sed 's/\/PERSON//g' | sort | uniq -c | sort -nr > CAT31091908_djvu_ner_pers_freq.txt
When we use less to explore this frequency file, we find some things of interest. During OCR, the letter F was frequently mistaken for P (there are 20 Pairchilds and 19 Fairchilds, for example, as well as four Prank Meyers.) We also see a number of words which will turn out not to be personal names on closer inspection. Brassica and Pistacia are plant genera, Iris Gardens is likely a place, N. W. of Hsing a direction and S. S. Feng Yang Maru the name of a vessel. But these errors are interesting, in the sense that they give us a bit better idea of what this text might be about. We also see different name forms for the same individual: Meyer, Frank Meyer and Frank N. Meyer, as well OCR errors for each.
Given a set of names, even a noisy one like this, we can begin to refine and automate our process of using the information that we have to find new sources that may be of interest. The second most common name in our text is Reimer. A quick Google search for “frank meyer” reimer turns up new sources that are very relevant to the text that we have, as do searches for “frank meyer” swingle, “frank meyer” wilson, and “feng yang maru”.
Organization and location names
The recognizer also tagged words in our text that appeared to be the names of organizations or locations. These labels are not perfect either, but they are similarly useful. To extract a list of the organization names and count the frequency of each, create a file called orgpattr containing the following regular expression.
Then run these commands.
egrep -o -f orgpattr CAT31091908_djvu_ner_clean.txt > CAT31091908_djvu_ner_org.txt cat CAT31091908_djvu_ner_org.txt | sed 's/\/ORGANIZATION//g' | sort | uniq -c | sort -nr > CAT31091908_djvu_ner_org_freq.txt
To pull out locations, create a file called locpattr containing the following regular expression. (Note that place names are often separated by commas, as in Chihli Prov., China.)
Then run these commands.
egrep -o -f locpattr CAT31091908_djvu_ner_clean.txt > CAT31091908_djvu_ner_loc.txt cat CAT31091908_djvu_ner_loc.txt | sed 's/\/LOCATION//g' | sort | uniq -c | sort -nr > CAT31091908_djvu_ner_loc_freq.txt
Use less to explore both of the new frequency files. The location file has trailing commas and spaces, which are throwing off the frequency counts. Can you figure out how to modify the pipeline and or regular expression to fix this?
Google searches for new terms like “frank meier” hupeh and “yokohama nursery company” continue to turn up relevant new sources. Later we will learn how to automate the process of using named entities to spider for new sources.