Archives for the month of: June, 2013

502180199_5659506a5d_b-edited

Introduction

In earlier posts we used a variety of tools to locate and contextualize words and phrases in texts, including regular expressions, concordances and search engines. In every case, however, we had to have some idea of what we were looking for. This can be a problem in exploratory research, because you usually don’t know what you don’t know. Of course it is always possible to read or skim through moderate amounts of text, but that approach doesn’t scale up to massive amounts of text. In any event, our goal is to save our care and attention for the tasks that actually require it, and to use the computer for everything else. In this post we will be using named entity recognition software from the Stanford Natural Language Processing group to automatically find people, places and organizations mentioned in a text.

In this post, we are going to be working with a book-length collection of correspondence from the Internet Archive. The digital text was created with optical character recognition (OCR) rather than being entered by hand, so there will be a number of errors in it. Download the file.

wget http://archive.org/download/CAT10662165MeyerSouthChinaExplorations/CAT31091908_djvu.txt

Installation

The recognizer is written in Java, so we need to make sure that is installed first. Try

man java

If you don’t get a man page in return, install the Java runtime and development kits.

sudo aptitude update
sudo aptitude upgrade
sudo aptitude install default-jre default-jdk

Confirm that you have Java version 1.6 or greater with

java -version

You will also need utilities for working with zipped files. Try

man unzip

If you don’t get a man page, you will need to install the following.

sudo aptitude install zip unzip

Now you can download the named entity recognition software. The version that I installed was 3.2.0 (June 20, 2013). If a newer version is available, you will have to make slight adjustments to the following commands

wget http://nlp.stanford.edu/software/stanford-ner-2013-06-20.zip
unzip stanford*.zip
rm stanford*.zip
mv stanford* stanford-ner

[UPDATE 2104. If you are using the HistoryCrawler VM, Stanford NER is already installed in your home directory. You can copy it to your current working directory with the following command.]

cp -r ~/stanford-ner .

Labeling named entities

Now we can run the named entity recognizer on our text. The software goes through every word in our text and tries to figure out if it is a person, organization or location. If it thinks it is a member of one of those categories, it will append a /PERSON, /ORGANIZATION or /LOCATION tag respectively. Otherwise, it appends a /O tag. (Note that this is an uppercase letter O, not a zero). There are more than 80,000 words in our text, so this takes a few minutes.

stanford-ner/ner.sh CAT31091908_djvu.txt > CAT31091908_djvu_ner.txt

Spend some time looking at the labelled text with

less CAT31091908_djvu_ner.txt

Our next step is to create a cleaner version of the labeled file by removing the /O tags. Recall that the sed command ‘s/before/after/g’ changes all instances of the before pattern to the after pattern. The pattern that we are trying to match has a forward slash in it, so we have to precede that with a backslash to escape it. We don’t want to match the capital O in the /ORGANIZATION tag, either, so we have to be careful to specify the blank spaces in both before and after patterns. The statement that we want looks like this

sed 's/\/O / /g' < CAT31091908_djvu_ner.txt > CAT31091908_djvu_ner_clean.txt

You can use less to confirm that we removed the /O tags but left the other ones untouched.

Matching person tags

Next we will use regular expressions to create a list of persons mentioned in the text (and recognized as such by our named entity recognizer). The recognizer is by no means perfect: it will miss some people and misclassify some words as personal names. Our text is also full of spelling and formatting errors that are a result of the OCR process. Nevertheless, we will see that the results are surprisingly useful.

The regular expression that we need to match persons is going to be complicated, so we will build it up in a series of stages. We start by creating an alias for egrep

alias egrepmatch='egrep --color -f pattr'

and we create a file called pattr that contains

Meyer

Now when we run

egrepmatch C*clean.txt

we see all of the places in our text where the name Meyer appears. But we also want to match the tag itself, so we edit pattr so it contains the following and rerun our egrepmatch command.

Meyer/PERSON

That doesn’t match any name except Meyer. Let’s change pattr to use character classes to match all the other personal names. Re-run the egrepmatch command each time you change pattr.

[[:alpha:]]*/PERSON

That is better, but it is missing the middle initial in forms like Frank N. Meyer. So we have to say we want to match either alphabetical characters or a period. Now pattr looks like the following. (Because the period is a special character in regular expressions we have to escape it with a backslash.)

([[:alpha:]]|\.)*/PERSON

That looks pretty good. The next problem is a bit more subtle. If we stick with the pattern that we have now, it is going to treat Henry/PERSON Hicks/PERSON as two separate person names, rather than grouping them together. What we need to capture is the idea that a string of words tagged with /PERSON are clumped together to form a larger unit. First we modify pattr so that each matching term is followed either by whitespace or by an end of line character.

([[:alpha:]]|\.)*/PERSON([[:space:]]|$)

The last step is to indicate that one or more of these words tagged with /PERSON makes up a personal name. Our final pattern is as follows.

(([[:alpha:]]|\.)*/PERSON([[:space:]]|$))+

Copy this file to a new file called personpattr.

Extracting a list of personal names and counting them

To make a list of persons in our text, we run egrep with the -o option, which omits everything except the matching pattern. Run the following command and then explore CAT31091908_djvu_ner_pers.txt with less. Note that it includes names like David Fairchild and E. E. Wilson, which is what we wanted.

egrep -o -f personpattr CAT31091908_djvu_ner_clean.txt > CAT31091908_djvu_ner_pers.txt

Next we use a familiar pipeline to create frequency counts for each person labeled in our text.

cat CAT31091908_djvu_ner_pers.txt | sed 's/\/PERSON//g' | sort | uniq -c | sort -nr > CAT31091908_djvu_ner_pers_freq.txt

When we use less to explore this frequency file, we find some things of interest. During OCR, the letter F was frequently mistaken for P (there are about the same number of Pairchilds and Fairchilds, for example, as well as four Prank Meyers.) We also see a number of words which will turn out not to be personal names on closer inspection. Brassica and Pistacia are plant genera, Iris Gardens is likely a place, N. W. of Hsing a direction and S. S. Feng Yang Maru the name of a vessel. But these errors are interesting, in the sense that they give us a bit better idea of what this text might be about. We also see different name forms for the same individual: MeyerFrank Meyer and Frank N. Meyer, as well OCR errors for each.

Given a set of names, even a noisy one like this, we can begin to refine and automate our process of using the information that we have to find new sources that may be of interest. The second most common name in our text is Reimer. A quick Google search for “frank meyer” reimer turns up new sources that are very relevant to the text that we have, as do searches for “frank meyer” swingle“frank meyer” wilson, and “feng yang maru”.

Organization and location names

The recognizer also tagged words in our text that appeared to be the names of organizations or locations. These labels are not perfect either, but they are similarly useful. To extract a list of the organization names and count the frequency of each, create a file called orgpattr containing the following regular expression.

(([[:alnum:]]|\.)+/ORGANIZATION([[:space:]]|$))+

Then run these commands.

egrep -o -f orgpattr CAT31091908_djvu_ner_clean.txt > CAT31091908_djvu_ner_org.txt
cat CAT31091908_djvu_ner_org.txt | sed 's/\/ORGANIZATION//g' | sort | uniq -c | sort -nr > CAT31091908_djvu_ner_org_freq.txt

To pull out locations, create a file called locpattr containing the following regular expression. (Note that place names are often separated by commas, as in Chihli Prov., China.)

(([[:alnum:]]|\.)+/LOCATION[[:space:]](,[[:space:]])?)+

Then run these commands.

egrep -o -f locpattr CAT31091908_djvu_ner_clean.txt > CAT31091908_djvu_ner_loc.txt
cat CAT31091908_djvu_ner_loc.txt | sed 's/\/LOCATION//g' | sort | uniq -c | sort -nr > CAT31091908_djvu_ner_loc_freq.txt

Use less to explore both of the new frequency files. The location file has trailing commas and spaces, which are throwing off the frequency counts. Can you figure out how to modify the pipeline and or regular expression to fix this?

Google searches for new terms like “frank meier” hupeh and “yokohama nursery company” continue to turn up relevant new sources. Later we will learn how to automate the process of using named entities to spider for new sources.

4261424420_2cf10fbf44_b-edited

Introduction

In previous posts we downloaded a single book from the Internet Archive, calculated word frequencies, searched through it with regular expressions, and created a permuted term index. In this post, we extend our command line methods to include automatically downloading an arbitrarily large batch of files and building a simple search engine for our collection of sources.

In order to download a batch of files from the Internet Archive, we need a search term that will work on the advanced search page of the site. For the example here, I am going to be using the search

collection:gutenberg AND subject:"Natural history -- Juvenile literature"

There is a very nice post on the Internet Archive blog explaining the process in detail for a Mac or Windows machine. Here we will do everything at the Linux command line. For our trial we will only be using a few books, but the same method works just as well for hundreds or thousands of sources.

[UPDATE 2014. Because the subject categories at the Internet Archive were changed for one of the books since this post was written, I have had to make a few minor edits below. These do not change the sense of the original lesson.]

URL Encoding and the HTTP GET Method

First, a quick review of URL encoding and the HTTP GET method. Files on a web server are are stored in a nested directory structure that is similar to the Linux / UNIX filesystem. To request a file, you have to give your web browser (or a program like wget) a URL, or uniform resource locator. Think of this like the address of a file. It starts with a message telling the server what protocol you want to use to communicate (e.g., HTTP). This is followed by the name of the host (typically a domain name like archive.org), an optional port number (which we don’t need to deal with here) and then the path to the resource you want. If the resource is a file sitting in a directory, the path will look a lot like a file path in Linux. For example,

https://github.com/williamjturkel/Digital-Research-Methods/blob/master/README.md

For many sites, however, it is possible to send a custom query to the web server and receive some content as a result. One way of doing this is with the HTTP GET method. Your file path includes a query string like

?lastname=Andrews&firstname=Jane

and the server responds appropriately. (The exact query that you send will depend on the particular web server that you are contacting.)

Regardless of whether you are requesting a single file or sending an HTTP GET, there has to be a way of dealing with blank spaces, punctuation and other funky characters in the URL. This is handled by URL encoding, which converts the URL into a form which can be readily sent online.

When it is URL encoded, the query string that we are going to send to the Internet Archive

collection:gutenberg AND subject:"Natural history -- Juvenile literature"

becomes something like

collection%3Agutenberg%20AND%20subject%3A%22Natural+history+--+Juvenile+literature%22

or

collection%3Agutenberg+AND+subject%3A%22Natural+history+--+Juvenile+literature%22

We will see one way to do this URL encoding below. In the meantime, if you would like to use a browser to see which files we are going to be batch downloading, you can see the search results here. Make sure to look at the URL in your browser’s address bar.

Using cat to Build a Query String

The cat command gives us one quick way to create small files at the Linux command line. We want a file called beginquery that contains the following text

http://archive.org/advancedsearch.php?q=

To get that, we can enter the following command, type in the line of text that we want, press Return/Enter at the end of the line, then hit control-c

cat > beginquery

Now you can use

cat beginquery

to make sure the file looks like it should. If you made a mistake, you can use sed to fix it, or delete the file with rm and start again. Using the same method, create a file called endquery which contains the following, and check your work.

&fl[]=identifier&output=csv

We’ve created the beginning and end of a query string we are going to send to the Internet Archive using wget. We still need to URL encode the query itself, and then insert it into the string.

For URL encoding, we are going to use a slick method developed by Ruslan Spivak. We use the alias command to create a URL encoder with one line of Python code. (I won’t explain how this works here, but if you would like to learn more about Python programming for humanists, there are introductory lessons at the Programming Historian website.)

At the command line you can enter the following, then type alias to check your work.

alias urlencode='python -c "import sys, urllib as ul; print ul.quote_plus(sys.argv[1])"'

If you made a mistake, you can remove the alias with

unalias urlencode

and try again. If your urlencode alias is OK, you can now use it to create a query string for the IA.

urlencode 'collection:gutenberg AND subject:"Natural history -- Juvenile literature"' > querystring

We then use cat to put the three pieces of our query together, and check our work.

cat beginquery querystring endquery | tr -d '\n' | sed '$a\' > iaquery

Note that we had to remove the newlines from the individual pieces of our query, then add one newline at the end. This is what the tr and sed commands do in the pipeline above. Use cat to look at iaquery and make sure it looks OK. Mine looks like

http://archive.org/advancedsearch.php?q=collection%3Agutenberg+AND+subject%3A%22Natural+history+--+Juvenile+literature%22&fl%5B%5D=identifier&output=csv

If you want to download sources for a number of different Internet Archive searches, you just need to create new querystring and iaquery files for each.

Downloading a Batch of Files with wget

When we use wget to send the query that we just constructed to the Internet Archive, their webserver will respond with a list of item identifiers. The -i option to wget tells it to use the query file that we just constructed, and the -O option tells it to put the output in a file called iafilelist. We then clean up that file list by deleting the first line and removing quotation marks, as follows

wget -i iaquery -O iafilelist
cat iafilelist | tr -d [:punct:] | sed '1d' > iafilelist-clean

If you use cat to look at iafilelist-clean, it should contain the following filenames

rollosexperiment24993gut
achildsprimerofn26331gut
rollosmuseum25548gut
woodlandtales23667gut
countrywalksofan23941gut
theorbispictus28299gut

(If new files in this subject category are added to the Internet Archive, or if the metadata for any of these files are changed, your list of filenames may be different. At this point, you can edit iafilelist-clean to match the list above if you wish. Then your results should match mine exactly.)

We now have a list of files that we want to download from the Internet Archive. We will run wget again to get all of these files. Read the original post on the Internet Archive blog to learn more about the wget options being used. We create a directory to hold our downloads, then download the six books. This only takes a few seconds.

mkdir download
cd download
wget -r -H -nc -nd -np -nH --cut-dirs=2 -A .txt -e robots=off -l1 -i ../iafilelist-clean -B 'http://archive.org/download/'

I’ve chosen to work with a small batch of texts here for convenience, but essentially the same techniques work with a huge batch, too. If you do download a lot of files at once, you will probably want to remove the -nd option, so wget puts each source in a directory of its own. It is also very important not to hose other people’s web servers. You can learn more about polite uses of wget at the Programming Historian website.

Cleaning Up

When we downloaded the books, we ended up with a lot of metadata files and alternate copies. We will save the ones that we want and delete the rest. Look through the downloaded files with

ls -l | less

then clean up the metadata files and the other versions of the text

mkdir metadata
mv ?????.txt_meta.txt metadata/
mv ?????-0.txt_meta.txt metadata/
ls metadata

there should be six files in the download/metadata directory. Now you can get rid of the rest of the stuff in the download directory we won’t be using.

rm *meta.txt
rm pg*txt
rm *-8.txt

We are left with text versions of our six books in the download directory and the metadata files for those versions in the download/metadata directory. Use less to explore the texts and their associated metadata files.

Bursting Texts into Smaller Pieces

Note that we can use grep on multiple files. In the download directory, try executing the following command.

egrep "(bees|cows)" * | less

This will give us a list of all of the lines in our six books where one or both of the search terms appears. It is not very useful, however, because we don’t have much sense of the larger context in which the term is situated, and we don’t know how relevant each instance is. Is it simply a passing mention of a bee or cow, or a passage that is about both? To answer that kind of query we will want to build a simple search engine for our collection of sources.

The first step is to burst each book into small pieces, each of which will fit on one screen of our terminal. The reason that we are doing this is because it won’t do us much good to find out, say, that the book Woodland Tales mentions bees and cows somewhere. We want to see the exact places where both are mentioned on a single page.

In Linux we can use the split command to burst a large file into a number of smaller ones. For example, the command below shows how we would split a file called filename.txt into pieces that are named filename.x.0000, filename.x.0001, and so on. The -d option tells split we want each file we create to be numbered, the -a 4 option tells it to use four digits, and the -l 20 option tells it to create files of (at most) twenty lines each. Don’t execute this command yet, however.

split -d -a 4  -l 20 filename.txt filename.x.

Instead we start by creating a directory to store the burst copies of our books. We then copy the originals into that directory.

cd ..
mkdir burstdocs
cd burstdocs
cp ../download/*.txt .
ls

The shell should respond with

23367.txt 23941.txt 24993.txt 25548.txt 26331.txt 28299-0.txt

We need to burst all of our books into pieces, not just one of them. We could type six split commands, but that would be laborious. Instead we will take advantage of the bash shell’s ability to automate repetitive tasks by using a for loop. Here is the whole command

for fileName in $(ls -1 *.txt) ; do split -d -a 4 -l 20 $fileName $fileName.x ; done

This command makes use of command substitution. It starts by executing the command

ls -1 *.txt

which creates a list of file names, one per line. The for loop then steps through this list and puts each file name into the fileName variable, one item at a time. Each time it is executed, the split command looks in fileName to figure out what file it is supposed to be processing. All of the files that split outputs are placed in the current directory, i.e., burstdocs.

We don’t want to keep our original files in this directory after bursting them, so we delete them now.

rm ?????.txt ?????-0.txt

We can use ls to see that bursting our documents has resulted in a lot of small files. We want to rename them to get rid of the ‘txt.’ that occurs in the middle of each filename, and then we want to add a .txt extension to each.

rename 's/txt.//' *
rename 's/$/.txt/' *

We can use ls to see that each filename now looks something like 28299-0.x0645.txt. We can also count the number of files in the burstdocs directory with

ls -1 | wc -l

There should be 1698 of them. Use cd to return to the directory which contains burstdocs and download.

Swish-e, the Simple Web Indexing System for Humans – Extended

To build a simple search engine, we are going to use the Swish-e package. This is not installed by default on Debian Linux, so you may need to install it yourself. Check to see with

man swish-e

If it is not installed, the shell will respond with “No manual entry for swish-e”. In this case, you can install the package with

sudo aptitude install swish-e

Try the man command again. You should now have a man page for swish-e.

The next step is to create a configuration file called swish.conf in your current directory. It should contain the following lines

IndexDir burstdocs/
IndexOnly .txt
IndexContents TXT* .txt
IndexFile ./burstdocs.index

Now we make the index with

swish-e -c swish.conf

Searching with Swish-e

If we want to search for a particular word–say ‘bees’–we can use the following command. The -f option tells swish-e which index we want to search in. The -m option says to return the ten most relevant results, and the -w option is our search keyword.

swish-e -f burstdocs.index -m 10 -w bees

The output consists of a list of files, sorted in decreasing order of relevance. We can use less to look at the first few hits and confirm that they do, in fact, have something to do with bees.

less burstdocs/24993.x0113.txt burstdocs/24993.x0118.txt burstdocs/23667.x0077.txt

When using less to look at a number of files like this, we can move back and forth with :n for next file, and :p for previous file. As always, q quits.

There are probably more than ten relevant hits in our set of books. We can see the next ten results with the option -b 11 which tells swish-e to begin with the eleventh hit.

swish-e -f burstdocs.index -m 10 -w bees -b 11

The real advantage of using a search engine comes in finding documents that are relevant to more complex queries. For example, we could search for passages that were about both bees and cows with

swish-e -f burstdocs.index -m 10 -H 0 -x '%p\n' -w "bees AND cows"

The -H 0 option tells swish-e not to print a header before our results, and the -x option says that we only want to see matching filenames, one per line. We see that there are four such passages.

It is a bit of a hassle to keep typing our long search command over and over, so we create an alias for that.

alias searchburst="swish-e -f burstdocs.index -m 10 -H 0 -x '%p\n' -w"

Now we can perform searches like

searchburst "bees AND cows"
searchburst "(bees AND cows) AND NOT clover"
searchburst "bees NEAR5 cows"

The last example returns results where ‘bees’ is within five words of ‘cows’.

A listing of relevant filenames is not very handy if we have to type each into the less command to check it out. Instead we can use the powerful xargs command to pipe our list of filenames into less.

searchburst "bees NEAR5 cows" | xargs less

349491600_77c31971a1_z-edited

Recap

In a previous post, we used wget to download a Project Gutenberg ebook from the Internet Archive, then cleaned up the file using the sed and tr commands. The code below puts all of the commands we used into a pipeline.

wget http://archive.org/download/thestoriesmother05792gut/stmtn10.txt
cat stmtn10.txt | tr -d '\r' | sed '2206,2525d' | sed '1,40d' > stmtn10-trimmedlf.txt

Using a pipeline of commands that we have already seen, we can also create a list of words in our ebook, one per line, sorted alphabetically. In English, the command below says “send the stmtn10-trimmedlf.txt file into a pipeline that translates uppercase characters into lowercase, translates hyphens into blank spaces, translates apostrophes into blank spaces, deletes all other punctuation, puts one word per line, sorts the words alphabetically, removes all duplicates and writes the resulting wordlist to a file called stmtn10-wordlist.txt“.

cat stmtn10-trimmedlf.txt | tr [:upper:] [:lower:] | tr '-' ' ' | tr "'" " " | tr -d [:punct:] | tr ' ' '\n' | sort | uniq > stmtn10-wordlist.txt

Note that we have to use double quotes around the tr expression that contains a single quote (i.e., apostrophe), so that the shell does not get confused about the arguments we are providing. Use less to explore stmtn10-wordlist.txt.

Dictionaries

Typically when you install Linux, at least one natural language dictionary is installed. Each dictionary is simply a text file that contains an alphabetical listing of ‘words’ in the language, one per line. The dictionaries are used for programs that do spell checking, but they are also a nice resource that can be used for text mining and other tasks. You will find them in the folder /usr/share/dict. I chose American English as my language when I installed Linux, so I have a dictionary called /usr/share/dict/american-english.

Suppose you are reading through a handwritten document and you come across a word that begins with an s, has two or three characters that you can’t make out, and ends in th.  You can use grep to search for the pattern in your dictionary to get some suggestions.

grep -E "^s.{2,3}th$" /usr/share/dict/american-english

The computer responds with the following.

saith
sheath
sixth
sleuth
sloth
smith
smooth
sooth
south
swath

In the statement above, the caret (^) and dollar sign ($) stand for the beginning and end of line respectively. Since each line in the dictionary file consists of a single word, we get words back. The dot (.) stands for a single character, and the pair of numbers in curly braces ({n,m}) say we are trying to match at least n characters and at most m.

Linux actually has a family of grep commands that match common options. There is a command called egrep, for example, which is equivalent to using grep -E, to match an extended set of patterns.  There is a command called fgrep which is a fast way to search for fixed strings (rather than patterns). We will use both egrep and fgrep in examples below. As with any Linux command, you can learn more about command line options with man.

Words in our text that aren’t in the dictionary

One way to use a dictionary for text mining is to get a list of words that appear in the text but are not listed in the dictionary. We can do this using the fgrep command, as show below. In English, the command says “using the file /usr/share/dict/american-english as a source of strings to match (-f option), find all the words (-w option) in stmtn10-wordlist.txt that are not in the list of strings (-v option) and send the results to the file stmtn10-nondictwords.txt“.

fgrep -v -w -f /usr/share/dict/american-english stmtn10-wordlist.txt > stmtn10-nondictwords.txt

Use the less command to explore stmtn10-nondictwords.txt. Note that it contains years (1861, 1865), proper names (alice, andrews, charles), toponyms (america, boston, calcutta) and British / Canadian spellings (centre, fibres). Note that it also includes a lot of specialized vocabulary which gives us some sense of what this text may be about: coccinea (a kind of plant), coraltown, cornfield, ferny, flatheads, goshawk, hepaticas (another plant), pitchy, quercus (yet another plant), seaweeds, and so on. Two interesting ‘words’ in this list are cucuie and ea. Use grep on stmtn10-trimmedlf.txt to figure out what they are.

Matching patterns within and across words

The grep command and its variants are useful for matching patterns both within a word and across a sequence of words. If we wanted to find all of the examples in our original text that contain an apostrophe s, we would use the command below. Note that the –color option colors the portion of the text that matches our pattern.

egrep --color "'s " stmtn10-trimmedlf.txt

If we wanted to find contractions, we could change the pattern to “‘t “, and if we wanted to match both we would use “‘. “ (this would also match abbreviations like discover’d).

We could search for the use of particular kinds of words. Which, for example, contain three vowels in a row?

egrep --color "[aeiou]{3}" stmtn10-trimmedlf.txt

We can also use egrep to search for particular kinds of phrases. For example, we could look for use of the first person pronoun in conjunction with English modal verbs.

egrep --color "I (can|could|dare|may|might|must|need|ought|shall|should|will|would)" stmtn10-trimmedlf.txt

Or we could see which pronouns are used with the word must:

egrep --color "(I|you|he|she|it|they) must" stmtn10-trimmedlf.txt

Spend some time using egrep to search for particular kinds of words and phrases. For example, how would you find regular past tense verbs (ones that end in -ed)? Years? Questions? Quotations?

Keywords in context

If you are interested in studying how particular words are used in a text, it is usually a good idea to build a concordance. At the Linux command line, this can be done easily using the ptx command, which builds a permuted term index. The command below uses the -f option to fold lowercase to uppercase for sorting purposes, and the -w option to set the width of our output to 50 characters.

ptx -f -w 50 stmtn10-trimmedlf.txt > stmtn10-ptx.txt

The output is stored in the file stmtn10-ptx.txt, which you can explore with less or search with grep.

If we want to find the word ‘giant’, for example, we might start with the following command. The -i option tells egrep to ignore case, so we get uppercase, lowercase and mixed case results.

egrep -i "[[:alpha:]]   giant" stmtn10-ptx.txt

Note that the word ‘giant’ occurs in many of the index entries. By preceding it with any alphabetic character, followed by three blank spaces, we see only those entries where ‘giant’ is the keyword in context. (Try grepping stmtn10-ptx.txt for the pattern “giant” to see what I mean.)

As a more detailed example, we might try grepping through our permuted term index to see if the author uses gendered pronouns differently.  Start by creating two files of pronouns in context.

egrep -i "[[:alpha:]]   (he|him|his) " stmtn10-ptx.txt > stmtn10-male.txt
egrep -i "[[:alpha:]]   (she|her|hers) " stmtn10-ptx.txt > stmtn10-female.txt

Now you can use wc -l to count the number of lines in each file, and less to page through them. We can also search both files together for interesting patterns.  If we type in the following command

cat *male* | egrep "   (he|she) .*ed"

we find “she died” and “she needs” versus “he toiled”, “he sighed”, “he flapped”, “he worked”, “he lifted”, “he dared”, “he lived”, “he pushed”, “he wanted” and “he packed”.

words-6222159602_be25e99546_z-edited

Introduction

In the Linux and Unix operating systems, everything is treated as a file. Whenever possible, those files are stored as human- and machine-readable text files. As a result, Linux contains a large number of tools that are specialized for working with texts. Here we will use a few of these tools to explore a textual source.

Downloading a text

Our first task is to obtain a sample text to analyze. We will be working with a nineteenth-century book from the Internet Archive: Jane Andrews, The Stories Mother Nature Told Her Children (1888, 1894). Since this text is part of the Project Gutenberg collection, it was typed in by humans, rather than being scanned and OCRed by machine. This greatly reduces the number of textual errors we expect to find in it.  To download the file, we will use the wget command, which needs a URL. We don’t want to give the program the URL that we use to read the file in our browser, because if we do the file that we download will have HTML markup tags in it. Instead, we want the raw text file, which is located at

http://archive.org/download/thestoriesmother05792gut/stmtn10.txt

First we download the file with wget, then we use the ls command (list directory contents) to make sure that we have a local copy.

wget http://archive.org/download/thestoriesmother05792gut/stmtn10.txt
ls

Our first view of the text

The Linux file command allows us to confirm that we have downloaded a text file. When we type

file stmtn10.txt

the computer responds with

stmtn10.txt: C source, ASCII text, with CRLF line terminators

The output of the file command confirms that this is an ASCII text (which we expect), guesses that it is some code in the C programming language (which is incorrect) and tells us that the ends of the lines in the file are coded with both a carriage return and a line feed. This is standard for Windows computers. Linux and OS X expect the ends of lines in an ASCII text file to be coded only with a line feed. If we want to move text files between operating systems, this is one thing we have to pay attention to. Later we will learn one method to convert the line endings from CRLF to LF, but for now we can leave the file as it is.

[UPDATE 2014. The file command no longer mistakenly identifies the file as C code.]

The head and tail commands show us the first few and last few lines of the file respectively.

head stmtn10.txt
The Project Gutenberg EBook of The Stories Mother Nature Told Her Children
by Jane Andrews

Copyright laws are changing all over the world. Be sure to check the
copyright laws for your country before downloading or redistributing
this or any other Project Gutenberg eBook.

This header should be the first thing seen when viewing this Project
Gutenberg file.  Please do not remove it.  Do not change or edit the
header without written permission.
tail stmtn10.txt

[Portions of this eBook's header and trailer may be reprinted only
when distributed free of all fees.  Copyright (C) 2001, 2002 by
Michael S. Hart.  Project Gutenberg is a TradeMark and may not be
used in any sales of Project Gutenberg eBooks or other materials be
they hardware or software or any other related product without
express permission.]

*END THE SMALL PRINT! FOR PUBLIC DOMAIN EBOOKS*Ver.02/11/02*END*

As we can see, the Project Gutenberg text includes some material in the header and footer which we will probably want to remove so we can analyze the source itself. Before modifying files, it is usually a good idea to make a copy of the original. We can do this with the cp command, then use the ls command to make sure we now have two copies of the file.

cp stmtn10.txt stmtn10-backup.txt
ls

In order to have a look at the whole file, we can use the less command. Once we run the following statement, we will be able to use the arrow keys to move up and down in the file one line at a time (or the j and k keys); the page up and page down keys to jump by pages (or the f and b keys); and the forward slash key to search for something (try typing /giant for example and then press the n key to see the next match). Press the q key to exit from viewing the file with less.

less -N stmtn10.txt

Trimming the header and footer

In the above case, we used the option -N to tell the less command that we wanted it to include line numbers at the beginning of each line. (Try running the less command without that option to see the difference.) Using the line numbers, we can see that the Project Gutenberg header runs from Line 1 to Line 40 inclusive, and that the footer runs from Line 2206 to Line 2525 inclusive. To create a copy of the text that has the header and footer removed, we can use the Linux stream editor sed. We have to start with the footer, because if we removed the header first it would change the line numbering for the rest of the file.

sed '2206,2525d' stmtn10.txt > stmtn10-nofooter.txt

This command tells sed to delete all of the material between lines 2206 and 2525 and output the results to a file called stmtn10-nofooter.txt. You can use less to confirm that this new file still contains the Project Gutenberg header but not the footer. We can now trim the header from this file to create another version with no header or footer. We will call this file stmtn10-trimmed.txt. Use less to confirm that it looks the way it should. While you are using less to view a file, you can use the g key to jump to the top of the file and the shift-g to jump to the bottom.

sed '1,40d' stmtn10-nofooter.txt > stmtn10-trimmed.txt

Use the ls command to confirm that you now have four files, stmtn10-backup.txtstmtn10-nofooter.txtstmtn10-trimmed.txt and stmtn10.txt.

A few basic statistics

We can use the wc command to find out how many lines (-l option) and how many characters (-m) our file has. Running the following shows us that the answer is 2165 lines and 121038 characters.

wc -l stmtn10-trimmed.txt
wc -m stmtn10-trimmed.txt

Finding patterns

Linux has a very powerful pattern-matching command called grep, which we will use frequently. At its most basic, grep returns lines in a file which match a pattern. The command below shows us lines which contain the word giant. The -n option asks grep to include line numbers. Note that this pattern is case sensitive, and will not match Giant.

grep -n "giant" stmtn10-trimmed.txt
1115:Do you believe in giants? No, do you say? Well, listen to my story,
1138:to admit that to do it needed a giant's strength, and so they deserve
1214:giants think of doing. We have not long to wait before we shall see, and

What if we wanted to find both capitalized and lowercase versions of the word? In the following command, we tell grep that we want to use an extended set of possible patterns (the -E option) and show us line numbers (the -n option). The pattern itself says to match something that starts either with a capital G or a lowercase g, followed by lowercase iant.

grep -E -n "(G|g)iant" stmtn10-trimmed.txt

Creating a standardized version of the text

When we are analyzing the words in a text, it is usually convenient to create a standardized version that eliminates whitespace and punctuation and converts all characters to lowercase. We will use the tr command to translate and delete characters of our trimmed text, to create a standardized version. First we delete all punctuation, using the -d option and a special pattern which matches punctuation characters. Note that in this case the tr command requires that we use the redirection operators to specify both the input file (<) and the output file (>). You can use the less command to confirm that the punctuation has been removed.

tr -d [:punct:] < stmtn10-trimmed.txt > stmtn10-nopunct.txt

The next step is to use tr to convert all characters to lowercase. Once again, use the less command to confirm that the changes have been made.

tr [:upper:] [:lower:] < stmtn10-nopunct.txt > stmtn10-lowercase.txt

Finally, we will use the tr command to convert all of the Windows CRLF line endings to the LF line endings that characterize Linux and OS X files. If we don’t do this, the spurious carriage return characters will interfere with our frequency counts.

tr -d '\r' < stmtn10-lowercase.txt > stmtn10-lowercaself.txt

Counting word frequencies

The first step in counting word frequencies is use the tr command to translate each blank space into an end-of-line character (or newline, represented by \n). This gives us a file where each word is on its own line. Confirm this using the less or head command on stmtn10-oneword.txt.

tr ' ' '\n' < stmtn10-lowercaself.txt > stmtn10-oneword.txt

The next step is to sort that file so the words are in alphabetical order, and so that if a given word appears a number of times, these are listed one after another. Once again, use the less command to look at the resulting file. Note that there are many blank lines at the beginning of this file, but if you page down you start to see the words: a lot of copies of a, followed by one copy of abashed, one of ability, and so on.

sort stmtn10-oneword.txt > stmtn10-onewordsort.txt

Now we use the uniq command with the -c option to count the number of repetitions of each line. This will give us a file where the words are listed alphabetically, each preceded by its frequency. We use the head command to look at the first few lines of our word frequency file.

uniq -c stmtn10-onewordsort.txt > stmtn10-wordfreq.txt
head stmtn10-wordfreq.txt
    358
      1 1861
      1 1865
      1 1888
      1 1894
    426 a
      1 abashed
      1 ability
      4 able
     44 about

Pipelines

When using the tr command, we saw that it is possible to tell a Linux command where it is getting its input from and where it is sending its output to. It is also possible to arrange commands in a pipeline so that the output of one stage feeds into the input of the next. To do this, we use the pipe operator (|). For example, we can create a pipeline to go from our lowercase file (with Linux LF endings) to word frequencies directly, as shown below. This way we don’t create a bunch of intermediate files if we don’t want to. You can use the less command to confirm that stmtn10-wordfreq.txt and stmtn10-wordfreq2.txt look the same.

tr ' ' '\n' < stmtn10-lowercaself.txt | sort | uniq -c > stmtn10-wordfreq2.txt

When we use less to look at one of our word frequency files, we can search for a particular term with the forward slash. Trying /giant, for example, shows us that there are sixteen instances of the word giants in our text. Spend some time exploring the original text and the word frequency file with less.