349491600_77c31971a1_z-edited

Recap

In a previous post, we used wget to download a Project Gutenberg ebook from the Internet Archive, then cleaned up the file using the sed and tr commands. The code below puts all of the commands we used into a pipeline.

wget http://archive.org/download/thestoriesmother05792gut/stmtn10.txt
cat stmtn10.txt | tr -d '\r' | sed '2206,2525d' | sed '1,40d' > stmtn10-trimmedlf.txt

Using a pipeline of commands that we have already seen, we can also create a list of words in our ebook, one per line, sorted alphabetically. In English, the command below says “send the stmtn10-trimmedlf.txt file into a pipeline that translates uppercase characters into lowercase, translates hyphens into blank spaces, translates apostrophes into blank spaces, deletes all other punctuation, puts one word per line, sorts the words alphabetically, removes all duplicates and writes the resulting wordlist to a file called stmtn10-wordlist.txt“.

cat stmtn10-trimmedlf.txt | tr [:upper:] [:lower:] | tr '-' ' ' | tr "'" " " | tr -d [:punct:] | tr ' ' '\n' | sort | uniq > stmtn10-wordlist.txt

Note that we have to use double quotes around the tr expression that contains a single quote (i.e., apostrophe), so that the shell does not get confused about the arguments we are providing. Use less to explore stmtn10-wordlist.txt.

Dictionaries

Typically when you install Linux, at least one natural language dictionary is installed. Each dictionary is simply a text file that contains an alphabetical listing of ‘words’ in the language, one per line. The dictionaries are used for programs that do spell checking, but they are also a nice resource that can be used for text mining and other tasks. You will find them in the folder /usr/share/dict. I chose American English as my language when I installed Linux, so I have a dictionary called /usr/share/dict/american-english.

Suppose you are reading through a handwritten document and you come across a word that begins with an s, has two or three characters that you can’t make out, and ends in th.  You can use grep to search for the pattern in your dictionary to get some suggestions.

grep -E "^s.{2,3}th$" /usr/share/dict/american-english

The computer responds with the following.

saith
sheath
sixth
sleuth
sloth
smith
smooth
sooth
south
swath

In the statement above, the caret (^) and dollar sign ($) stand for the beginning and end of line respectively. Since each line in the dictionary file consists of a single word, we get words back. The dot (.) stands for a single character, and the pair of numbers in curly braces ({n,m}) say we are trying to match at least n characters and at most m.

Linux actually has a family of grep commands that match common options. There is a command called egrep, for example, which is equivalent to using grep -E, to match an extended set of patterns.  There is a command called fgrep which is a fast way to search for fixed strings (rather than patterns). We will use both egrep and fgrep in examples below. As with any Linux command, you can learn more about command line options with man.

Words in our text that aren’t in the dictionary

One way to use a dictionary for text mining is to get a list of words that appear in the text but are not listed in the dictionary. We can do this using the fgrep command, as show below. In English, the command says “using the file /usr/share/dict/american-english as a source of strings to match (-f option), find all the words (-w option) in stmtn10-wordlist.txt that are not in the list of strings (-v option) and send the results to the file stmtn10-nondictwords.txt“.

fgrep -v -w -f /usr/share/dict/american-english stmtn10-wordlist.txt > stmtn10-nondictwords.txt

Use the less command to explore stmtn10-nondictwords.txt. Note that it contains years (1861, 1865), proper names (alice, andrews, charles), toponyms (america, boston, calcutta) and British / Canadian spellings (centre, fibres). Note that it also includes a lot of specialized vocabulary which gives us some sense of what this text may be about: coccinea (a kind of plant), coraltown, cornfield, ferny, flatheads, goshawk, hepaticas (another plant), pitchy, quercus (yet another plant), seaweeds, and so on. Two interesting ‘words’ in this list are cucuie and ea. Use grep on stmtn10-trimmedlf.txt to figure out what they are.

Matching patterns within and across words

The grep command and its variants are useful for matching patterns both within a word and across a sequence of words. If we wanted to find all of the examples in our original text that contain an apostrophe s, we would use the command below. Note that the –color option colors the portion of the text that matches our pattern.

egrep --color "'s " stmtn10-trimmedlf.txt

If we wanted to find contractions, we could change the pattern to “‘t “, and if we wanted to match both we would use “‘. “ (this would also match abbreviations like discover’d).

We could search for the use of particular kinds of words. Which, for example, contain three vowels in a row?

egrep --color "[aeiou]{3}" stmtn10-trimmedlf.txt

We can also use egrep to search for particular kinds of phrases. For example, we could look for use of the first person pronoun in conjunction with English modal verbs.

egrep --color "I (can|could|dare|may|might|must|need|ought|shall|should|will|would)" stmtn10-trimmedlf.txt

Or we could see which pronouns are used with the word must:

egrep --color "(I|you|he|she|it|they) must" stmtn10-trimmedlf.txt

Spend some time using egrep to search for particular kinds of words and phrases. For example, how would you find regular past tense verbs (ones that end in -ed)? Years? Questions? Quotations?

Keywords in context

If you are interested in studying how particular words are used in a text, it is usually a good idea to build a concordance. At the Linux command line, this can be done easily using the ptx command, which builds a permuted term index. The command below uses the -f option to fold lowercase to uppercase for sorting purposes, and the -w option to set the width of our output to 50 characters.

ptx -f -w 50 stmtn10-trimmedlf.txt > stmtn10-ptx.txt

The output is stored in the file stmtn10-ptx.txt, which you can explore with less or search with grep.

If we want to find the word ‘giant’, for example, we might start with the following command. The -i option tells egrep to ignore case, so we get uppercase, lowercase and mixed case results.

egrep -i "[[:alpha:]]   giant" stmtn10-ptx.txt

Note that the word ‘giant’ occurs in many of the index entries. By preceding it with any alphabetic character, followed by three blank spaces, we see only those entries where ‘giant’ is the keyword in context. (Try grepping stmtn10-ptx.txt for the pattern “giant” to see what I mean.)

As a more detailed example, we might try grepping through our permuted term index to see if the author uses gendered pronouns differently.  Start by creating two files of pronouns in context.

egrep -i "[[:alpha:]]   (he|him|his) " stmtn10-ptx.txt > stmtn10-male.txt
egrep -i "[[:alpha:]]   (she|her|hers) " stmtn10-ptx.txt > stmtn10-female.txt

Now you can use wc -l to count the number of lines in each file, and less to page through them. We can also search both files together for interesting patterns.  If we type in the following command

cat *male* | egrep "   (he|she) .*ed"

we find “she died” and “she needs” versus “he toiled”, “he sighed”, “he flapped”, “he worked”, “he lifted”, “he dared”, “he lived”, “he pushed”, “he wanted” and “he packed”.