If you are collecting local copies of documents that you haven’t read yet, you need to use computer programs to search through and organize them. Since all text is OCRed, it is easy enough to use Spotlight to do the searching (or a program like DevonThink: more on this in a future post). The problem is that it does not do you much good to find out that your search term(s) appear in a book-length document. If you have a lot of reference books in your local digital archive, you will get many hits for common search terms. If you have to open each reference and look at each match, you are wasting your time. Admittedly, it is faster than consulting the index of a paper book and flipping to the relevant section. The process can be greatly sped up, however.
The trick is to take long PDFs and “burst” them into single pages. That way, when you search for terms in your local archive, your results are in the form
- Big Reference Book, p 72
- Interesting Article, p 7
- Interesting Article, p 2
- Another Article, p 49
- Big Reference Book, p 70
- Interesting Article, 1
- Interesting Article (1, 2, 7)
- Big Reference Book (70, 72)
- Another Article (49)
The first advantage is that you don’t have to open each individual source (like Big Reference Book) and, in effect, repeat your search to find all instances of your search terms within the document. The second advantage is that the search algorithm determines the relevance of each particular page to your search terms, rather than the source as a whole. So the few valuable hits in a long, otherwise irrelevant document are surfaced in a convenient manner. Sources that are on-topic (like “Interesting Article” in the example above), will show up over and over in your results list. If you decide you want to read the whole source, you can open the original PDF rather than opening the individual page files.
It is easy enough to write a short script to burst PDFs if you know how to program. If you don’t, however, you can use an Automator workflow to accomplish the same results. If you’d like, you can even turn your bursting workflow into a droplet. I’ve attached a sample workflow to get you started.