In September, Tim Hitchcock and I had a chance to meet with Adam Farquhar at the British Library to talk about potential collaborative research projects. Adam suggested that we might do something with a collection of about 25,000 E-books. Although I haven’t had much time yet to work with the sources, one of the things that I am interested in is using techniques from image processing and computer vision to supplement text mining. As an initial project, I decided to see if I could find a way to automatically extract images from the collection.
My first thought was that I might be able to identify text based on its horizontal and vertical correlation. Parts of the image that were not text would then be whitespace, illustration or marginalia. (One way to do this in Mathematica is to use the ImageCooccurence function). As I was trying to figure out the details, however, I realized that a much simpler approach might work. Since the method seems promising I decided to share it so that other people might get some use out of it (or suggest improvements).
In a British Library E-book, each page has a JPEG page image and an associated ALTO (XML) file which contains the OCRed text. The basic idea is to compare the JPEG image file size with the ALTO file size for the same page. Pages that have a lot of text (and no images) should have large ALTO files relative to the size of the JPEG. Pages with an image but little or no text should have a large JPEG relative to the size of the ALTO file. Blank pages should have relatively small JPEG and ALTO files.
The graph below shows the results for an E-book chosen at random from the sample. Textual pages cluster, pages with images tend to cluster, and blank pages (and covers) fall out along one axis because they have no text at all. We can use more sophisticated image processing and machine learning to further subdivide images and extract them once they are located, but this seems pretty good for a first pass.