Archives for the month of: March, 2011


If you are collecting local copies of documents that you haven’t read yet, you need to use computer programs to search through and organize them.  Since all text is OCRed, it is easy enough to use Spotlight to do the searching (or a program like DevonThink: more on this in a future post).  The problem is that it does not do you much good to find out that your search term(s) appear in a book-length document.  If you have a lot of reference books in your local digital archive, you will get many hits for common search terms.  If you have to open each reference and look at each match, you are wasting your time.  Admittedly, it is faster than consulting the index of a paper book and flipping to the relevant section.  The process can be greatly sped up, however.

The trick is to take long PDFs and “burst” them into single pages.  That way, when you search for terms in your local archive, your results are in the form

  • Big Reference Book, p 72
  • Interesting Article, p 7
  • Interesting Article, p 2
  • Another Article, p 49
  • Big Reference Book, p 70
  • Interesting Article, 1

rather than

  • Interesting Article (1, 2, 7)
  • Big Reference Book (70, 72)
  • Another Article (49)

The first advantage is that you don’t have to open each individual source (like Big Reference Book) and, in effect, repeat your search to find all instances of your search terms within the document.  The second advantage is that the search algorithm determines the relevance of each particular page to your search terms, rather than the source as a whole.  So the few valuable hits in a long, otherwise irrelevant document are surfaced in a convenient manner.  Sources that are on-topic (like “Interesting Article” in the example above), will show up over and over in your results list.  If you decide you want to read the whole source, you can open the original PDF rather than opening the individual page files.

It is easy enough to write a short script to burst PDFs if you know how to program.  If you don’t, however, you can use an Automator workflow to accomplish the same results.  If you’d like, you can even turn your bursting workflow into a droplet.  I’ve attached a sample workflow to get you started.

Once you start collecting large numbers of digital sources by searching for them or using an information trapping strategy, you will find that you are often in the position of wanting to download a lot of files from a given site.  Obviously you can click on the links one at a time to make local copies, but that loses one of the main advantages of working with computers–letting them do your work for you.  Instead, you should use a program like DownThemAll (a Firefox extension), SiteSucker (a standalone program) or GNU wget (a command line tool that you call from the terminal).  Each of these automates what would otherwise be a very repetitive and thankless task.  If you have never tried this before, start by getting comfortable with one of the more user-friendly alternatives like DownThemAll or SiteSucker, then move to wget when you need more power.  You can also write your own custom programs to harvest sources, of course.  There is an introductory lesson on this in the Programming Historian.

Now imagine that you have a harvester that is not only capable of downloading files, but of extracting the hyperlinks found on each page, navigating to the new pages and downloading those also.  This kind of program is called a spider, crawler or bot.  Search engine companies make extensive use of web spiders to create a constantly updated (and constantly out-of-date), partial map of the entire internet.  For research, it is really nice to be able to spider more limited regions of the web in search of quality sources.  Again, it is not too difficult to write your own spiders (see Kevin Hemenway and Tara Calishain’s Spidering Hacks) but there are also off-the-shelf tools which will do some spidering for you.  In addition to writing my own spiders, I’ve used a number of these packages.  Here I will describe DevonAgent.

DevonAgent includes a web browser with some harvesting capabilities built in.  You can describe what you are looking for with a rich set of operators like “(Industrial BEFORE Revolution) AND (India NEAR/10 Textile*)”.  Results are displayed with both relevance ranking and with an interactive graphical topic map that you can use to navigate.  You can keep your results in an internal archive or export them easily to DevonThink.  (More on this in a future post).  You can schedule searches to run automatically, thus extending your information trapping strategy.  DevonAgent also has what are called “scanners”: filters that recognize particular kinds of information.  You can search, for example, for pages that contain PDFs, e-mail addresses, audio files, spreadsheets or webcams.  You can also specify which URLs your spider will visit (including password protected sites).  DevonAgent comes with about 80 plugins for search engines and databases, including sites like Google Scholar, IngentaConnect, the Internet Archive and Project Gutenberg.  You can also write your own plugins in XML.

DevonAgent allows you to follow links automatically and to set the depth of the search.  If you were looking at this blog post with DevonAgent, a Level 1 search would also retrieve the pages linked to by this post (for DownThemAll, etc.), it would retrieve some other pages from my blog, and so on.  A Level 2 search would retrieve all of the stuff that a Level 1 search gets, plus the pages that the DownThemAll page links to, the pages linked to in some of my other blog posts, and so on.  Since a spider is a program that you run while you are doing something else, it is OK if it goes down a lot of blind alleys in order to find things that are relatively rare.  Learning where to start and how to tune the depth of your search is an essential part of using spidering for your research.  DevonAgent will use Growl to notify you when its search is complete.  (If there is something that I am eagerly awaiting, I also use Prowl to get Growl notifications when I’m away from my computer.  But you may find that’s too much of a good thing.)


If you are just getting started with online research, there are some things that are handy to know, and a few tools you might like to set up for yourself.

Analog and digital.  When I talk to my students about the difference between analog and digital representations, I use the example of two clocks.  The first is the kind that has hour and minute hands, and perhaps one for seconds, too.  At some point you learned how to tell time on an analog clock, and it may have seemed difficult.  Since the clock takes on every value in between two times, telling time involves a process of measurement. You say, “it’s about 3:15,” but the time changes continuously as you do so.  Telling time with a digital clock, by contrast, doesn’t require you to do more than read the value on the display.  It is 3:15 until the clock says it is 3:16.  Digital representations can only take on one of a limited (although perhaps very large) number of states.  Not every digital representation is electronic.  Writing is digital, too, in the sense that there are only a finite number of characters, and instances of each are usually interchangeable.  You can print a book in a larger font, or in Braille, without changing the meaning.  Salvador Dalí’s melting clocks, however, would keep different time–which was the point, of course.

The costs are different.  Electronic digital information can be duplicated at near-zero cost, transmitted at the speed of light, stored in infinitesimally small volumes, and created, processed and consumed by machines.  This means that ideas that were more-or-less serviceable in the world before networked computers–ideas about value, property rights, communication, creativity, intelligence, governance and many other aspects of society and culture–are now up for debate.  The emergence of new rights regimes (such as open access, open content and open source) and the explosion of new information are manifestations of these changing costs.

You won’t be able to read everything.  Estimates of the amount of new information that is now created annually are staggering (2003, 2009).  As you become more skilled at finding online sources, you will discover that new material on your topic appears online much faster than you can read it.  The longer you work on something, the more behind you will get.  This is OK, because everyone faces this issue whether they realize it or not.  In traditional scholarship, scarcity was the problem: travel to archives was expensive, access to elite libraries was gated, resources were difficult to find, and so on.  In digital scholarship, abundance is the problem.  What is worth your attention or your trust?

Assume that what you want is out there, and that you simply need to locate it.  I first found this advice in Thomas Mann’s excellent Oxford Guide to Library Research.  Although Mann’s book focuses primarily on pre-digital scholarship, his strategies for finding sources are more relevant than ever.  Don’t assume that you are the best person for the job, either.  Ask a librarian for help.  You’ll find that they tend to be nicer, better informed, more helpful and more tech savvy than the people you usually talk to about your work.  Librarians work constantly on your behalf to solve problems related to finding and accessing information.

The first online tool you should master is the search engine.  The vast majority of people think that they can type a word or two into Google and choose something from the first page of millions of results.  If they don’t see what they’re looking for, they try a different keyword or give up.  When I talk to scholars who aren’t familiar with digital research, their first assumption is often that there aren’t any good online resources for their subject.  A little bit of guided digging often shows them that this is far from the truth.  So how do you use search engines more effectively?  First of all, sites have an advanced search page that lets you focus in on your topic, exclude search terms, weight some terms more than others, limit your results to particular kinds of document, to particular sites, to date ranges, and so on. Second, different search engines introduce different kinds of bias by ranking results differently, so you get a better view when you routinely use more than one.  Third, all of the major search engines keep introducing new features, so you have to keep learning new skills.  Research technique is something you practice, not something you have.

Links are the currency of the web.  Links make it possible to navigate from one context to another with a single click.  For human users, this greatly lowers the transaction costs of comparing sources.  The link is abstract enough to serve as means of navigation and able to subsume traditional scholarly activities like footnoting, citation, glossing and so on.  Furthermore, extensive hyperlinking allows readers to follow nonlinear and branching paths through texts.  What many people don’t realize is that links are constantly being navigated by a host of artificial users, colorfully known as spiders, bots or crawlers. A computer program downloads a webpage, extracts all of the links and other content on it, and follows each new link in turn, downloading the pages that it encounters along the way.  This is where search engine results come from: the ceaseless activity of millions of automated computer programs that constantly remake a dynamic and incomplete map of the web.  It has to be this way, because there is no central authority.  Anyone can add stuff to the web or remove it without consulting anyone else.

The web is not structured like a ball of spaghetti.  Research done with web spiders has shown that a lot of the most interesting information to be gleaned from digital sources lies in the hyperlinks leading into and out of various nodes, whether personal pages, documents, archives, institutions, or what have you. Search engines provide some rudimentary tools for mapping these connections, but much more can be learned with more specialized tools.  Some of the most interesting structure is to be found in social networks, because…

Emphasis is shifting from a web of pages to a web of people.  Sites like Blogger, WordPress, Twitter, Facebook, Flickr and YouTube put the emphasis on the contributions of individual people and their relationships to one another.  Social searching and social recommendation tools allow you to find out what your friends or colleagues are reading, writing, thinking about, watching, or listening to.  By sharing information that other people find useful, individuals develop reputation and change their own position in social networks.  Some people are bridges between different communities, some are hubs in particular fields, and many are lonely little singletons with one friend.  This is very different from the broadcast world, where there were a few huge hubs spewing to a thronging multitude (known as “the audience“).

Ready to jump in?  Here are some things you might try.

Customize a browser to make it more useful for research

  1. Install Firefox
  2. Add some search extensions
    1. Worldcat
    2. Internet Archive
    3. Project Gutenberg
    4. Merriam-Webster
  3. Pull down the search icon and choose “Manage Search Engines” -> “Get more search engines” then search for add-ons within Search Tools
  4. Try search refinement
    1. Install the Deeper Web add-on and try using tag clouds to refine your search
    2. Example.  Suppose you are trying to find out more about a nineteenth-century missionary named Adam Elliot.  If you try a basic Google search, you will soon discover there is an Australian animator with the same name.  Try using Deeper Web to find pages on the missionary.
  5. Add bookmarks for advanced search pages to the bookmark toolbar
    1. Google Books (note that you can set to “full view only” or “full view and preview”
    2. Historic News (experiment with the timeline view)
    3. Internet Archive
    4. Hathi Trust
    5. Flickr Commons (try limiting by license)
    6. Wolfram Alpha (try “China population”)
    7. Google Ngram Viewer
  6. You can block sites that have low quality results

Work with citations

  1. Install Zotero
    1. Try grabbing a record from the Amazon database
    2. Use Zotero to make a note of stuff you find using Search Inside on your book
    3. Under the info panel in Zotero, use the Locate button to find a copy in a library near you
    4. From the WorldCat page, try looking for related works and saving a whole folder of records in Zotero
    5. Find the book in a library and save the catalog information as a page snapshot
    6. Go into Google Books, search for full view only, download metadata and attach PDF
  2. Learn more about Zotero
  3. Explore other options for bibliographic work (e.g. Mendeley)

Find repositories of digital sources on your topic

Here are a few examples for various kinds of historical research.

  1. Canada
    1. Canadiana.org
    2. Dictionary of Canadian Biography online
    3. Repertory of Primary Source Databases
    4. Our Roots
    5. British Columbia Digital Library
    6. Centre for Contemporary Canadian Art
  2. US
    1. Library of Congress Digital Collections
    2. American Memory
    3. NARA
    4. Calisphere
    5. NYPL Digital Projects
  3. UK and Europe
    1. Perseus Digital Library
    2. UK National Archives
    3. Old Bailey Online
    4. London Lives
    5. Connected Histories
    6. British History Online
    7. British Newspapers 1800-1900
    8. English Broadside Ballad Archive
    9. EuroDocs
    10. Europeana
    11. The European Library
    12. Gallica
  4. Thematic (e.g., history of science and medicine)
    1. Complete Works of Charles Darwin Online
    2. Darwin Correspondence Project
    3. National Library of Medicine Digital Projects

Capture some RSS feeds

  1. Install the Sage RSS feed reader in Firefox (lots of other possibilities, like Google Reader)
    1. Go to H-Net and choose one or more lists to monitor; drag the RSS icons to the Sage panel and edit properties
    2. Go to Google News and create a search that you’d like to monitor; drag RSS to Sage panel and edit properties
    3. Go to Hot New Releases in Expedition and Discovery at Amazon and subscribe to RSS feed
  2. Consider signing up for Twitter, following a lot of colleagues and aggregating their interests with Tweeted Times (this is how Digital Humanities Now works)

Discover some new tools

  1. The best place to start is Lisa Spiro’s Getting Started in the Digital Humanities and Digital Research Tools wiki

The digital world is a world of abundance.  The most scarce resource in your research process is always going to be your own time.  Ideally, then, you only want to pay attention to those things which absolutely require it–everything else should be handed off to computer programs.  The digital world is also a plastic world, where anything can change without warning.  Bookmarks are of little value, and the last thing you want to do is engage in a long and fruitless online search for something that you know you found once.

What this means in practical terms is that whenever you look at something, you want your computer to have a copy of what you’ve seen.  If you need to look at a source again to cite it, or quote it, or reread it in the light of new information, that source should be instantly ready-to-hand.  You can accomplish this by always saving a local, digital copy of everything you read, along with basic metadata: who created it? where did you find it? when did you look at it?

If you read something on paper, digitize and OCR it, then add the metadata to a bibliographic database.  There are many options here.  You can use Zotero in the Firefox browser on any platform.  On the Mac, Sente, Bookends and Mendeley are all popular choices.  (I’ve used all of these in my own research.  Each has advantages and disadvantages, so you might have to try a few before settling on something that really works for you.  The main point is that you need a bibliographic database, so choose one).  If you look at something online, download a copy and OCR it if necessary.  If you are browsing the web, take a screenshot of the information you want to keep, and OCR it.  The Mac already has excellent built-in screen capturing abilities, but I also use LittleSnapper because it gives me more flexibility in what I save.

I like to keep all of my local copies of documents as PDFs because the format keeps the text and image layer of the document together.  I use a consistent file naming convention so that I know what a source should be called if I already have a copy of it.  Dates in filenames are always YYYYMMDD, so the files get sorted in chronological order when I look at the contents of a folder.  I don’t use capital letters, spaces, or any punctuation other than dot, dash and underscore in my filenames.  This makes it easier to write programs to process whole batches of files.  (It is possible to write a program to parse any filename, of course, but who needs extra hassle?)

Although I do spend a lot of time thinking about my research, I also do other things like sleep, eat, go for walks, read books, teach, and so on.  Fortunately, my research continues even when my attention is directed elsewhere.  If I were working with traditional analog sources, the only way to accomplish this would be to hire human research assistants.  They are expensive, need to eat and sleep etc., and might have better things to do than work on my project.  Since I am working with digital sources, however, I can employ as many computational processes as I’d like to help me.  It’s even possible to create computational processes that can make additional copies of themselves to help with a heavy workload–but that is an advanced technique that I won’t get into here.

This is one area where knowing how to program can make a really big difference in your research, because you can create programs to help you with any task that you can specify clearly.  In my own research I am doing a lot of custom programming, but my goal here is to lay out a method that doesn’t require any programming.

Usually when you do a search, you skim through the results and read or bookmark whatever interests you.  That is fine if you are looking for something in particular, but what do you do if you want to keep up with the news on a topic, or create a steady stream of results on a regular basis?  Tara Calishain calls this process Information Trapping.  The basic idea is to use focused searching to create RSS feeds, and then to monitor these feeds with a feed reader or feed aggregator.  You can use a desktop application to read your feeds (like NetNewsWire on the Mac) or an online service like Google Reader.  I have both kinds of reader set up, but for my ‘super-secret’ monograph I am actually reading the feeds with a different program, as I will describe in a future post.

The major search engines like Google and Yahoo! provide mechanisms for creating RSS feeds from searches.  Once you have refined a particular search for something that you want to monitor (hint: always use the advanced search page), you can subscribe to a feed for that search, and your reader will let you know whenever it finds anything new.

Two additional techniques can make this strategy even more powerful.  First of all, it is possible to use a service like Feed43 to create RSS feeds for any webpage.  This has a bit more of a learning curve, but it allows you to monitor anything on the web.  Second, Yahoo! Pipes provides a mechanism for combing RSS feeds with other kinds of computational processing.  Again, there is a learning curve, but it’s well worth it.  See Marshall Kirkpatrick’s “5 Minute Intro to Yahoo Pipes” to get started.