In previous posts, we looked at a variety of Linux command line techniques for analyzing text and finding patterns in it, including word frequencies, permuted term indexes, regular expressions, simple search engines and named entity recognition. In this post we focus on a preliminary issue: converting images of texts into text files that we can work with. Starting with digital photographs or scans of documents, we can apply optical character recognition (OCR) to create machine-readable texts. These will certainly have some errors, but the quality tends to be surprisingly good for clean scans of recently typed or printed pages. Older fonts and texts, or warped, indistinct or blurry page images often result in lower quality OCR.
Using a window manager
As with earlier posts, we are going to use command line tools to process our files. When working with page images, however, it is very useful to be able to see pictures. The standard Linux console does not have this facility, so we need to use a window manager or a GUI desktop environment. The former is a lightweight application that allows you to view and manipulate multiple windows at the same time; the latter is a full-blown interface to your operating system that includes graphical versions of your applications. Sometimes you use a mouse with a window manager, but most of your interactions continue to be at the command line. With a GUI desktop, the expectation is that you will spend most of your time using a mouse for interaction (this is very familiar to users of Windows or OS X). In Linux, you can choose from a variety of window managers and desktop environments.
[UPDATE 2014. If you are using the HistoryCrawler virtual machine, you already have the KDE GUI installed. Skip the “Installation” section and go directly to “Viewing Images of Text”.]
If you are working with a Linux distribution that does not already have a windowing manager or desktop environment installed, you will need one. Here I will be using a window manager called Openbox, but most of the commands should work fine with other Linux configurations. Try
If you don’t get a man page, you can install X Windows and Openbox with the following.
sudo aptitude install xorg xserver-xorg xterm sudo aptitude install openbox obconf obmenu
Next we will want some command-line image processing software to manipulate page images. Check to see if ImageMagick is installed with
If you don’t get a man page, install it with
sudo aptitude install imagemagick
We will install three other utilities for working with files: zip / unzip, pandoc and tre-agrep. Check for man pages…
man zip man pandoc man tre-agrep
… and install if necessary.
sudo aptitude install zip unzip sudo aptitude install pandoc sudo aptitude install tre-agrep
Finally, we want to install Tesseract, the program which performs the OCR. Check to see if it is already installed with
If not, install with the following commands
sudo aptitude install tesseract-ocr tesseract-ocr-eng
Note that I am only installing the English language OCR package here. If you want to install additional natural languages, see the Tesseract web site for further instructions.
We start our windowing manager with
Once it is running the background will turn dark grey and you will see a mouse pointer. Right click on the mouse to get a menu and choose “Terminal emulator”. This will give us the terminal that we will use for our commands.
Viewing images of text
The source that we will be working with is the same one that we used in the previous post. It is a collection of scanned correspondence between Frank N. Meyer and his superior at the US Department of Agriculture, relating to an expedition to South China between 1916 and 1918. We start by making a directory for the source and downloading XML metadata, OCRed text and a zipped directory of page images (in JPEG 2000 format). This will take a few minutes to complete.
mkdir meyer cd meyer wget -r -H -nc -nd -np -nH --cut-dirs=2 -e robots=off -l1 -A .xml,.txt,.zip 'http://archive.org/download/CAT10662165MeyerSouthChinaExplorations/'
Next we unzip the directory of page images and remove the zipped file.
unzip CAT*zip rm CAT*zip mv CAT31091908_jp2 jp2files
Let’s take the hundredth page of our source as a sample image to work with. We can look at the JPEG 2000 image of the page using the ImageMagick display command. Since the image is very high resolution, we scale it to 25% of the original size and even then we can only see a small amount of it in the window. We have to scroll around with the mouse to look at it. Click the X in the upper right hand corner of the display window to dismiss it. Note that since we don’t want to tie up our terminal while the display command is operating, we run it in the background by adding an ampersand to the command line.
cp jp2files/CAT31091908_0100.jp2 samplepg.jp2 display -geometry 25%x25% samplepg.jp2 &
We can create a smaller version of the page image with the ImageMagick convert command. This version is much easier to read on screen.
convert samplepg.jp2 -resize 25% samplepg-25.jpg display samplepg-25.jpg &
Optical character recognition
We can open the OCRed text from the Internet Archive with
Use /49 and n in the less display to search for the page in question. Note how good the OCR is on the first part of that page, confusing only the 2 and comma in the date “June 29, 1917”. Skipping ahead, we see a few other errors: “T realize” for “I realize”, “v/hat” for “what”, and the like. For convenience, let’s yank the OCRed text out of the file to a separate file. We use less -N to find the line numbers of the beginning and end of the OCRed page.
less -N CAT31091908_djvu.txt cat CAT31091908_djvu.txt | sed '5837,5896!d' > samplepg_djvu.txt less samplepg_djvu.txt
Since we already have a good text file for this document, we don’t really need to do OCR on the page images. If we did not have the text, however, we could create our own with Tesseract. Since Tesseract does not work with JPEG 2000 images, we first use ImageMagick to create a greyscale TIF file. Try the following commands.
convert samplepg.jp2 -colorspace Gray samplepg-gray100.tif tesseract samplepg-gray100.tif samplepg_tesseract less samplepg_tesseract.txt
Note that this OCR is really good, too, and that the few errors that occurred are different than the ones in the DJVU OCR. We can use the diff command to find the parts of the two OCR files that do not overlap. The options which we provide to diff here cause it to ignore blank lines and whitespace, and to report only on those lines which differ from one file to the next. You can learn more about the output of the diff command by consulting its man page.
diff -b -B -w --suppress-common-lines samplepg_djvu.txt samplepg_tesseract.txt | less
A trickier case for OCR
A clean, high resolution scan of a page of printed text is the best-case scenario for OCR. If you do archival work, you may have a lot of digital photos of documents that are rotated, warped, unevenly lit, blurry, or partially obscured by fingers. The documents themselves may be photocopies, mimeograph pages, dot-matrix printouts, or something even more obscure. In cases like these, you have to decide how much time you want to spend cleaning up your page images. If you have a hundred of them, and each is very important to your project, it is worth doing it right. If you have a hundred thousand and you just want to mine them for interesting patterns, something quicker and dirtier will have to suffice.
As an example of a more difficult OCR job, consider this newspaper article about Meyer’s expedition from the Tacoma Times (15 Feb 1910). This comes from the Library of Congress Chronicling America project, a digital archive of historic newspapers that provides JPEG 2000, PDF and OCR text files for every page, neatly laid out in a directory structure that is optimized for automatic processing.
First we download the image and OCR text. When we ask for the latter, we will actually get an HTML page, so we use pandoc to convert that to text. Then we use sed to extract the part of the OCR text that corresponds to our article, and use less to display it. We see that the supplied OCR is pretty rough, but probably contains enough recognizable keywords to be useful for search (e.g., “persimmon”, “Meyer”, “China”, “Taft”, “radishes”, “cabbages”, “kaki”).
wget http://chroniclingamerica.loc.gov/lccn/sn88085187/1910-02-15/ed-1/seq-4.jp2 -O tacoma.jp2 wget http://chroniclingamerica.loc.gov/lccn/sn88085187/1910-02-15/ed-1/seq-4/ocr/ -O tacoma-lococr.html pandoc -o tacoma-lococr.txt tacoma-lococr.html cat tacoma-lococr.txt | sed '117,244!d' | sed '55,102d' | tr -d '\\' > tacoma-meyer-lococr.txt less tacoma-meyer-lococr.txt
Before trying to display the JPEG 2000 file, we use the ImageMagick identify command to learn more about it. We see that it is 5362×6862 pixels and 4.6MB in size. That is too big for us to look at easily, but we can use ImageMagick to make a small JPG copy to display.
identify tacoma.jp2 convert tacoma.jp2 -resize 10% tacoma-10.jpg display tacoma-10.jpg
The Chronicling America site already provides OCR text which is usable for some tasks. It’s not clear if we can do a better job or not. If we wanted to try, we would start by using ImageMagick to extract the region of the JPEG 2000 image that contains the Meyer article, then use Tesseract on that. The sequence of commands below does exactly this. It is not intended to be tutorial, but rather to suggest how one might use command line tools to begin to figure out a workflow for dealing with tricky OCR cases. In this case, the Tesseract output is not really better than the OCR supplied with the source, although it might be possible to get better results with more image processing. If you find yourself using ImageMagick a lot for this kind of work, you might be interested in the textcleaner script from Fred’s ImageMagick Scripts.
convert -extract 2393x3159+0+4275 tacoma.jp2 tacoma-extract.jp2 convert -extract 1370x2080+100+200 tacoma-extract.jp2 tacoma-meyer.jp2 display tacoma-meyer.jp2 & convert tacoma-meyer.jp2 tacoma-meyer.tif tesseract tacoma-meyer.tif tacoma-meyer-tesocr less tacoma-meyer-tesocr.txt
Assessing OCR quality
Let’s return to the Meyer correspondence. We have an OCR file for the whole document, CAT31091908_djvu.txt. Using techniques that we covered in previous posts, we can create a word list…
cat CAT31091908_djvu.txt | tr [:upper:] [:lower:] | tr -d [:punct:] | tr -d [:digit:] | tr ' ' '\n' | sort > CAT31091908_djvu-allwords.txt uniq CAT31091908_djvu-allwords.txt > CAT31091908_djvu-wordlist.txt less CAT31091908_djvu-wordlist.txt
… then determine word frequencies and look through them to figure out what the text might be about. We find “letter”, “meyer”, “china”, “pear”, “species”, “seed”, “reimer”, “chinese”, “seeds” and “plants”.
uniq -c CAT31091908_djvu-allwords.txt | sort -n -r > CAT31091908_djvu-wordfreqs.txt less CAT31091908_djvu-wordfreqs.txt
We can also find all of the non-dictionary words in our OCR text and study that list to learn more about the errors that may have been introduced.
fgrep -i -v -w -f /usr/share/dict/american-english CAT31091908_djvu-wordlist.txt > CAT31091908_djvu-nondictwords.txt less CAT31091908_djvu-nondictwords.txt
We see things that look like prefixes and suffixes: “agri”, “ameri”, “alities”, “ation”. This suggests we might want to do something more sophisticated with hyphenation. We see words that may be specialized vocabulary, rather than OCR errors: “amaranthus”, “amygdalus”, “beancheese”. We also see variants of terms which clearly are OCR errors: “amydalus”, “amykdalus”, “amypdalu”.
Approximate pattern matching
When we used pattern matching in the past, we looked for exact matches. But it would be difficult to come up with regular expressions to match the range of possible OCR errors (or spelling mistakes) that we might find in our sources. In a case like this we want to use fuzzy or approximate pattern matching. The tre-agrep command lets us find items that sort of match a pattern. That is, they match a pattern up to some specified number of insertions, deletions and substitutions. We can see this in action by gradually making our match fuzzier and fuzzier. Try the commands below.
tre-agrep -2 --color amygdalus CAT31091908_djvu.txt tre-agrep -4 --color amygdalus CAT31091908_djvu.txt
With two possible errors, we see matches for “Pyrus amgydali folia”, “AmyKdalus”, “Mnygdalus”, “itoygdalus”, “Araugdalus” and “Amy^dalus”. When we increase the number of possible errors to four, we see more OCR errors (like “Amy^dalus”) but we also begin to get a lot of false positives that span words (with the matches shown here between square brackets): “I [am als]o”, “m[any days]”, “pyr[amidal f]orm”, “orn[amental s]tock”, “hills [and dales]”. If it helps, you can think of fuzzy matching as a signal detection problem: we want to maximize the number of hits while minimizing the number of false positives.