Archives for category: Method

Here are some links for Spring 2019 talks on computational history that I gave at the Fields Institute and MIT.

Sites that can be used with no prior programming experience:

If you are comfortable with scripting:

Technical sources:


  • Brugger, The Archived Web (2019)
  • Hartog, Regimes of Historicity (2016)
  • Milligan, History in the Age of Abundance? (2019)
  • Snyder, The Road to Unfreedom (2018)
  • Tooze, Crashed (2018)


In September, Tim Hitchcock and I had a chance to meet with Adam Farquhar at the British Library to talk about potential collaborative research projects. Adam suggested that we might do something with a collection of about 25,000 E-books. Although I haven’t had much time yet to work with the sources, one of the things that I am interested in is using techniques from image processing and computer vision to supplement text mining. As an initial project, I decided to see if I could find a way to automatically extract images from the collection.

My first thought was that I might be able to identify text based on its horizontal and vertical correlation. Parts of the image that were not text would then be whitespace, illustration or marginalia. (One way to do this in Mathematica is to use the ImageCooccurence function). As I was trying to figure out the details, however, I realized that a much simpler approach might work. Since the method seems promising I decided to share it so that other people might get some use out of it (or suggest improvements).

In a British Library E-book, each page has a JPEG page image and an associated ALTO (XML) file which contains the OCRed text. The basic idea is to compare the JPEG image file size with the ALTO file size for the same page. Pages that have a lot of text (and no images) should have large ALTO files relative to the size of the JPEG. Pages with an image but little or no text should have a large JPEG relative to the size of the ALTO file. Blank pages should have relatively small JPEG and ALTO files.

The graph below shows the results for an E-book chosen at random from the sample. Textual pages cluster, pages with images tend to cluster, and blank pages (and covers) fall out along one axis because they have no text at all. We can use more sophisticated image processing and machine learning to further subdivide images and extract them once they are located, but this seems pretty good for a first pass.

Let me wrap up this series of posts by suggesting that the most important aspect of the method is to practice what is called kaizen in Japanese: make continuous small improvements.  Part of this is simply a willingness to keep tweaking something even while it is working.  The other part is an ability to measure the effects of the changes that you do make.  When I started looking at the word counts for my writing from day-to-day, I realized that the kind of music that I was listening to made a difference.  So I started systematically buying songs at iTunes and seeing what kind of impact they had.  When I found something that increased my productivity, I used the ‘Genius’ feature to find related songs and added them to my playlist.  (iTunes: “Want to tell your friends about this?  Connect with them on Ping.”  Me: “Are you crazy?  My friends would mock me until I cried.  I can barely admit to myself that I spent 99 cents on that song.”)  But if listening to terrible music allows me to write 14% more each day, I have one extra day each week to work on something else.

Programmers use the verb refactor to describe the process of taking apart something that is working, optimizing the pieces, and putting it back together.  For a while I used DevonThink to organize all of my notes, but I realized that the features that made it work so well for a focused research project also made it too cumbersome to handle the minutiae of day-to-day life.  So I’m now using OmniFocus (and Getting Things Done) to keep track of the things that I have to do, and using Evernote to collect random notes on everything that is not part of an existing research project or course that I’m teaching.  ProfHacker has a number of good posts about things that you can do with Evernote (1, 2, 3).  For me, the main advantages are that it is lightweight, cloud-based and accessible from almost every computing device I own.

So far we’ve discussed creating a backup system, a local archive of OCRed digital sources and a bibliographic database.  We’ve also covered a number of strategies for adding new material to our local repository automatically, including search, information trapping, and spidering.  Two programs lie at the core of the workflow, DevonThink Pro and Scrivener.  When I add a new source to my local repository, I create a bibliographic record for it, then I index it in DevonThink. Once it is indexed I can use the searching, clustering and concordance tools in DevonThink to explore my sources.  Since everything that is related to my project is indexed in DT, it is the one local place where I go to look for information.  Rather than explain all of the useful things that can be done with your sources in DT (and there are a lot) I will simply refer to the articles by Steven Johnson and Chad Black that convinced me to give DT a try:

A couple of notes about DT.  If you do decide to buy the software, buy the $10 e-book on how to use the software at the same time.  There are a lot of powerful features and this is the fastest way to learn them.  Unlike Johnson, I store everything related to my project in DT.  That is why I advise bursting large documents into smaller units for better searching and clustering.  As a historian, I also tend to write books that have more chronological structure than his do, so I use a sheet in DevonThink to create a chronology that can be sorted by date, name, event, and source.  It is not as flexible as a spreadsheet, but it does not have to be.

For writing, I am using Scrivener, which allows me to draft and rearrange small units of text easily.  I can copy a passage that I’ve just written from Scrivener into DT and use the magic hat to find anything in my sources that may be related to that paragraph (just as Johnson describes).  The super-secret monograph consists of seven chapters of five subsections each.  In Scrivener, I can see the status of each of those pieces at a glance: how much has been written, how much needs to be done, and how each relates to other subsections.  Rather than facing a yawning gulf at the bottom of a word processer screen, and potential writer’s block, I can see the book take shape as I work on it.  When the draft is finished in Scrivener, it is easily compiled to whatever form that the publisher wants.  I can’t say enough good things about Scrivener.  By itself I’m sure it has doubled my writing speed.

The key is working with small enough units of text.  When you cite a source, you are typically referring to a part of it: a quote, a paragraph, a passage, an argument.  Similarly, when you write a book, you can only work on a small part of it at one time.  Inappropriate tools force you to manipulate objects at the wrong scale, usually the whole document, however long.  Better tools, such as DT and Scrivener, allow you to focus on exactly the pieces you need for the task at hand.

If you are collecting local copies of documents that you haven’t read yet, you need to use computer programs to search through and organize them.  Since all text is OCRed, it is easy enough to use Spotlight to do the searching (or a program like DevonThink: more on this in a future post).  The problem is that it does not do you much good to find out that your search term(s) appear in a book-length document.  If you have a lot of reference books in your local digital archive, you will get many hits for common search terms.  If you have to open each reference and look at each match, you are wasting your time.  Admittedly, it is faster than consulting the index of a paper book and flipping to the relevant section.  The process can be greatly sped up, however.

The trick is to take long PDFs and “burst” them into single pages.  That way, when you search for terms in your local archive, your results are in the form

  • Big Reference Book, p 72
  • Interesting Article, p 7
  • Interesting Article, p 2
  • Another Article, p 49
  • Big Reference Book, p 70
  • Interesting Article, 1

rather than

  • Interesting Article (1, 2, 7)
  • Big Reference Book (70, 72)
  • Another Article (49)

The first advantage is that you don’t have to open each individual source (like Big Reference Book) and, in effect, repeat your search to find all instances of your search terms within the document.  The second advantage is that the search algorithm determines the relevance of each particular page to your search terms, rather than the source as a whole.  So the few valuable hits in a long, otherwise irrelevant document are surfaced in a convenient manner.  Sources that are on-topic (like “Interesting Article” in the example above), will show up over and over in your results list.  If you decide you want to read the whole source, you can open the original PDF rather than opening the individual page files.

It is easy enough to write a short script to burst PDFs if you know how to program.  If you don’t, however, you can use an Automator workflow to accomplish the same results.  If you’d like, you can even turn your bursting workflow into a droplet.  I’ve attached a sample workflow to get you started.

Once you start collecting large numbers of digital sources by searching for them or using an information trapping strategy, you will find that you are often in the position of wanting to download a lot of files from a given site.  Obviously you can click on the links one at a time to make local copies, but that loses one of the main advantages of working with computers–letting them do your work for you.  Instead, you should use a program like DownThemAll (a Firefox extension), SiteSucker (a standalone program) or GNU wget (a command line tool that you call from the terminal).  Each of these automates what would otherwise be a very repetitive and thankless task.  If you have never tried this before, start by getting comfortable with one of the more user-friendly alternatives like DownThemAll or SiteSucker, then move to wget when you need more power.  You can also write your own custom programs to harvest sources, of course.  There is an introductory lesson on this in the Programming Historian.

Now imagine that you have a harvester that is not only capable of downloading files, but of extracting the hyperlinks found on each page, navigating to the new pages and downloading those also.  This kind of program is called a spider, crawler or bot.  Search engine companies make extensive use of web spiders to create a constantly updated (and constantly out-of-date), partial map of the entire internet.  For research, it is really nice to be able to spider more limited regions of the web in search of quality sources.  Again, it is not too difficult to write your own spiders (see Kevin Hemenway and Tara Calishain’s Spidering Hacks) but there are also off-the-shelf tools which will do some spidering for you.  In addition to writing my own spiders, I’ve used a number of these packages.  Here I will describe DevonAgent.

DevonAgent includes a web browser with some harvesting capabilities built in.  You can describe what you are looking for with a rich set of operators like “(Industrial BEFORE Revolution) AND (India NEAR/10 Textile*)”.  Results are displayed with both relevance ranking and with an interactive graphical topic map that you can use to navigate.  You can keep your results in an internal archive or export them easily to DevonThink.  (More on this in a future post).  You can schedule searches to run automatically, thus extending your information trapping strategy.  DevonAgent also has what are called “scanners”: filters that recognize particular kinds of information.  You can search, for example, for pages that contain PDFs, e-mail addresses, audio files, spreadsheets or webcams.  You can also specify which URLs your spider will visit (including password protected sites).  DevonAgent comes with about 80 plugins for search engines and databases, including sites like Google Scholar, IngentaConnect, the Internet Archive and Project Gutenberg.  You can also write your own plugins in XML.

DevonAgent allows you to follow links automatically and to set the depth of the search.  If you were looking at this blog post with DevonAgent, a Level 1 search would also retrieve the pages linked to by this post (for DownThemAll, etc.), it would retrieve some other pages from my blog, and so on.  A Level 2 search would retrieve all of the stuff that a Level 1 search gets, plus the pages that the DownThemAll page links to, the pages linked to in some of my other blog posts, and so on.  Since a spider is a program that you run while you are doing something else, it is OK if it goes down a lot of blind alleys in order to find things that are relatively rare.  Learning where to start and how to tune the depth of your search is an essential part of using spidering for your research.  DevonAgent will use Growl to notify you when its search is complete.  (If there is something that I am eagerly awaiting, I also use Prowl to get Growl notifications when I’m away from my computer.  But you may find that’s too much of a good thing.)

If you are just getting started with online research, there are some things that are handy to know, and a few tools you might like to set up for yourself.

Analog and digital.  When I talk to my students about the difference between analog and digital representations, I use the example of two clocks.  The first is the kind that has hour and minute hands, and perhaps one for seconds, too.  At some point you learned how to tell time on an analog clock, and it may have seemed difficult.  Since the clock takes on every value in between two times, telling time involves a process of measurement. You say, “it’s about 3:15,” but the time changes continuously as you do so.  Telling time with a digital clock, by contrast, doesn’t require you to do more than read the value on the display.  It is 3:15 until the clock says it is 3:16.  Digital representations can only take on one of a limited (although perhaps very large) number of states.  Not every digital representation is electronic.  Writing is digital, too, in the sense that there are only a finite number of characters, and instances of each are usually interchangeable.  You can print a book in a larger font, or in Braille, without changing the meaning.  Salvador Dalí’s melting clocks, however, would keep different time–which was the point, of course.

The costs are different.  Electronic digital information can be duplicated at near-zero cost, transmitted at the speed of light, stored in infinitesimally small volumes, and created, processed and consumed by machines.  This means that ideas that were more-or-less serviceable in the world before networked computers–ideas about value, property rights, communication, creativity, intelligence, governance and many other aspects of society and culture–are now up for debate.  The emergence of new rights regimes (such as open access, open content and open source) and the explosion of new information are manifestations of these changing costs.

You won’t be able to read everything.  Estimates of the amount of new information that is now created annually are staggering (2003, 2009).  As you become more skilled at finding online sources, you will discover that new material on your topic appears online much faster than you can read it.  The longer you work on something, the more behind you will get.  This is OK, because everyone faces this issue whether they realize it or not.  In traditional scholarship, scarcity was the problem: travel to archives was expensive, access to elite libraries was gated, resources were difficult to find, and so on.  In digital scholarship, abundance is the problem.  What is worth your attention or your trust?

Assume that what you want is out there, and that you simply need to locate it.  I first found this advice in Thomas Mann’s excellent Oxford Guide to Library Research.  Although Mann’s book focuses primarily on pre-digital scholarship, his strategies for finding sources are more relevant than ever.  Don’t assume that you are the best person for the job, either.  Ask a librarian for help.  You’ll find that they tend to be nicer, better informed, more helpful and more tech savvy than the people you usually talk to about your work.  Librarians work constantly on your behalf to solve problems related to finding and accessing information.

The first online tool you should master is the search engine.  The vast majority of people think that they can type a word or two into Google and choose something from the first page of millions of results.  If they don’t see what they’re looking for, they try a different keyword or give up.  When I talk to scholars who aren’t familiar with digital research, their first assumption is often that there aren’t any good online resources for their subject.  A little bit of guided digging often shows them that this is far from the truth.  So how do you use search engines more effectively?  First of all, sites have an advanced search page that lets you focus in on your topic, exclude search terms, weight some terms more than others, limit your results to particular kinds of document, to particular sites, to date ranges, and so on. Second, different search engines introduce different kinds of bias by ranking results differently, so you get a better view when you routinely use more than one.  Third, all of the major search engines keep introducing new features, so you have to keep learning new skills.  Research technique is something you practice, not something you have.

Links are the currency of the web.  Links make it possible to navigate from one context to another with a single click.  For human users, this greatly lowers the transaction costs of comparing sources.  The link is abstract enough to serve as means of navigation and able to subsume traditional scholarly activities like footnoting, citation, glossing and so on.  Furthermore, extensive hyperlinking allows readers to follow nonlinear and branching paths through texts.  What many people don’t realize is that links are constantly being navigated by a host of artificial users, colorfully known as spiders, bots or crawlers. A computer program downloads a webpage, extracts all of the links and other content on it, and follows each new link in turn, downloading the pages that it encounters along the way.  This is where search engine results come from: the ceaseless activity of millions of automated computer programs that constantly remake a dynamic and incomplete map of the web.  It has to be this way, because there is no central authority.  Anyone can add stuff to the web or remove it without consulting anyone else.

The web is not structured like a ball of spaghetti.  Research done with web spiders has shown that a lot of the most interesting information to be gleaned from digital sources lies in the hyperlinks leading into and out of various nodes, whether personal pages, documents, archives, institutions, or what have you. Search engines provide some rudimentary tools for mapping these connections, but much more can be learned with more specialized tools.  Some of the most interesting structure is to be found in social networks, because…

Emphasis is shifting from a web of pages to a web of people.  Sites like Blogger, WordPress, Twitter, Facebook, Flickr and YouTube put the emphasis on the contributions of individual people and their relationships to one another.  Social searching and social recommendation tools allow you to find out what your friends or colleagues are reading, writing, thinking about, watching, or listening to.  By sharing information that other people find useful, individuals develop reputation and change their own position in social networks.  Some people are bridges between different communities, some are hubs in particular fields, and many are lonely little singletons with one friend.  This is very different from the broadcast world, where there were a few huge hubs spewing to a thronging multitude (known as “the audience“).

Ready to jump in?  Here are some things you might try.

Customize a browser to make it more useful for research

  1. Install Firefox
  2. Add some search extensions
    1. Worldcat
    2. Internet Archive
    3. Project Gutenberg
    4. Merriam-Webster
  3. Pull down the search icon and choose “Manage Search Engines” -> “Get more search engines” then search for add-ons within Search Tools
  4. Try search refinement
    1. Install the Deeper Web add-on and try using tag clouds to refine your search
    2. Example.  Suppose you are trying to find out more about a nineteenth-century missionary named Adam Elliot.  If you try a basic Google search, you will soon discover there is an Australian animator with the same name.  Try using Deeper Web to find pages on the missionary.
  5. Add bookmarks for advanced search pages to the bookmark toolbar
    1. Google Books (note that you can set to “full view only” or “full view and preview”
    2. Historic News (experiment with the timeline view)
    3. Internet Archive
    4. Hathi Trust
    5. Flickr Commons (try limiting by license)
    6. Wolfram Alpha (try “China population”)
    7. Google Ngram Viewer
  6. You can block sites that have low quality results

Work with citations

  1. Install Zotero
    1. Try grabbing a record from the Amazon database
    2. Use Zotero to make a note of stuff you find using Search Inside on your book
    3. Under the info panel in Zotero, use the Locate button to find a copy in a library near you
    4. From the WorldCat page, try looking for related works and saving a whole folder of records in Zotero
    5. Find the book in a library and save the catalog information as a page snapshot
    6. Go into Google Books, search for full view only, download metadata and attach PDF
  2. Learn more about Zotero
  3. Explore other options for bibliographic work (e.g. Mendeley)

Find repositories of digital sources on your topic

Here are a few examples for various kinds of historical research.

  1. Canada
    2. Dictionary of Canadian Biography online
    3. Repertory of Primary Source Databases
    4. Our Roots
    5. British Columbia Digital Library
    6. Centre for Contemporary Canadian Art
  2. US
    1. Library of Congress Digital Collections
    2. American Memory
    3. NARA
    4. Calisphere
    5. NYPL Digital Projects
  3. UK and Europe
    1. Perseus Digital Library
    2. UK National Archives
    3. Old Bailey Online
    4. London Lives
    5. Connected Histories
    6. British History Online
    7. British Newspapers 1800-1900
    8. English Broadside Ballad Archive
    9. EuroDocs
    10. Europeana
    11. The European Library
    12. Gallica
  4. Thematic (e.g., history of science and medicine)
    1. Complete Works of Charles Darwin Online
    2. Darwin Correspondence Project
    3. National Library of Medicine Digital Projects

Capture some RSS feeds

  1. Install the Sage RSS feed reader in Firefox (lots of other possibilities, like Google Reader)
    1. Go to H-Net and choose one or more lists to monitor; drag the RSS icons to the Sage panel and edit properties
    2. Go to Google News and create a search that you’d like to monitor; drag RSS to Sage panel and edit properties
    3. Go to Hot New Releases in Expedition and Discovery at Amazon and subscribe to RSS feed
  2. Consider signing up for Twitter, following a lot of colleagues and aggregating their interests with Tweeted Times (this is how Digital Humanities Now works)

Discover some new tools

  1. The best place to start is Lisa Spiro’s Getting Started in the Digital Humanities and Digital Research Tools wiki

The digital world is a world of abundance.  The most scarce resource in your research process is always going to be your own time.  Ideally, then, you only want to pay attention to those things which absolutely require it–everything else should be handed off to computer programs.  The digital world is also a plastic world, where anything can change without warning.  Bookmarks are of little value, and the last thing you want to do is engage in a long and fruitless online search for something that you know you found once.

What this means in practical terms is that whenever you look at something, you want your computer to have a copy of what you’ve seen.  If you need to look at a source again to cite it, or quote it, or reread it in the light of new information, that source should be instantly ready-to-hand.  You can accomplish this by always saving a local, digital copy of everything you read, along with basic metadata: who created it? where did you find it? when did you look at it?

If you read something on paper, digitize and OCR it, then add the metadata to a bibliographic database.  There are many options here.  You can use Zotero in the Firefox browser on any platform.  On the Mac, Sente, Bookends and Mendeley are all popular choices.  (I’ve used all of these in my own research.  Each has advantages and disadvantages, so you might have to try a few before settling on something that really works for you.  The main point is that you need a bibliographic database, so choose one).  If you look at something online, download a copy and OCR it if necessary.  If you are browsing the web, take a screenshot of the information you want to keep, and OCR it.  The Mac already has excellent built-in screen capturing abilities, but I also use LittleSnapper because it gives me more flexibility in what I save.

I like to keep all of my local copies of documents as PDFs because the format keeps the text and image layer of the document together.  I use a consistent file naming convention so that I know what a source should be called if I already have a copy of it.  Dates in filenames are always YYYYMMDD, so the files get sorted in chronological order when I look at the contents of a folder.  I don’t use capital letters, spaces, or any punctuation other than dot, dash and underscore in my filenames.  This makes it easier to write programs to process whole batches of files.  (It is possible to write a program to parse any filename, of course, but who needs extra hassle?)

Although I do spend a lot of time thinking about my research, I also do other things like sleep, eat, go for walks, read books, teach, and so on.  Fortunately, my research continues even when my attention is directed elsewhere.  If I were working with traditional analog sources, the only way to accomplish this would be to hire human research assistants.  They are expensive, need to eat and sleep etc., and might have better things to do than work on my project.  Since I am working with digital sources, however, I can employ as many computational processes as I’d like to help me.  It’s even possible to create computational processes that can make additional copies of themselves to help with a heavy workload–but that is an advanced technique that I won’t get into here.

This is one area where knowing how to program can make a really big difference in your research, because you can create programs to help you with any task that you can specify clearly.  In my own research I am doing a lot of custom programming, but my goal here is to lay out a method that doesn’t require any programming.

Usually when you do a search, you skim through the results and read or bookmark whatever interests you.  That is fine if you are looking for something in particular, but what do you do if you want to keep up with the news on a topic, or create a steady stream of results on a regular basis?  Tara Calishain calls this process Information Trapping.  The basic idea is to use focused searching to create RSS feeds, and then to monitor these feeds with a feed reader or feed aggregator.  You can use a desktop application to read your feeds (like NetNewsWire on the Mac) or an online service like Google Reader.  I have both kinds of reader set up, but for my ‘super-secret’ monograph I am actually reading the feeds with a different program, as I will describe in a future post.

The major search engines like Google and Yahoo! provide mechanisms for creating RSS feeds from searches.  Once you have refined a particular search for something that you want to monitor (hint: always use the advanced search page), you can subscribe to a feed for that search, and your reader will let you know whenever it finds anything new.

Two additional techniques can make this strategy even more powerful.  First of all, it is possible to use a service like Feed43 to create RSS feeds for any webpage.  This has a bit more of a learning curve, but it allows you to monitor anything on the web.  Second, Yahoo! Pipes provides a mechanism for combing RSS feeds with other kinds of computational processing.  Again, there is a learning curve, but it’s well worth it.  See Marshall Kirkpatrick’s “5 Minute Intro to Yahoo Pipes” to get started.

The most important advantage of working with digital representations is that they can be computationally manipulated.  Nothing that happens on the internet would be possible if this were not the case.  Traditional analog sources, on the other hand, can really only be interpreted and used by people.  The upshot of this is that you will get the most benefit from your research if you make it a habit to digitize any source immediately if it is not already in digital form.

For my “super-secret” monograph project, I am working entirely with digital sources.  Since I am a historian, only a relatively small proportion of these were “born digital”.  These are sources that were created by or with digital devices.  Digital photographs and videos, e-mail messages, SMS texts, tweets, computer games, code and server logs are all examples of born digital sources.  They make up an increasing proportion of all of the information in the world.  The bulk of my sources actually began life as analog objects like books, letters, charts and film photographs, and were subsequently digitized by someone else.  In the 1980s and 90s, it seemed as if most traditional sources might never be digitized.  In the last seven years, however, Google alone has digitized more than a tenth of all of the books in the world.  At this point, my money is on the eventual digitization of absolutely everything.  So why not pitch in?

I still read physical papers and books, but instead of underlining them or writing notes on paper, I use an IRISPen to scan the quotes that I am going to want to access later.  The pen scanner is faster than typing notes, even if you are a fast typist.  If I think that the whole source will be useful (more on this in future posts) then I will use a standard desktop scanner to scan whole pages at a time.  If I’m in an archive, I use whatever combination I can of handheld computer, laptop computer, digital camera and flatbed scanner.  When I have research assistants, I ask them to scan sources for me rather than photocopy them.

For documents that are not born digital, the initial phase of digitization only creates digital pictures of text.  Optical character recognition (OCR) is the next crucial step.  I use the full version of Adobe Acrobat (Pro, not the free Reader) to add a text layer to any digital document that does not already have one.  This is useful not only for documents that you digitize yourself, but also for documents that you download from the internet.  Many sites use OCR on their documents to create a text layer for searching purposes, then strip out that layer from the PDFs that they make available for download.  You can use Acrobat Pro to re-scan the document and add the text layer back in, and you will want to make it a habit to do that.