This page contains workflows for doing research with digital sources.

  • Off-the-Shelf Tools on OS X (ca 2010-11). A collection of useful proprietary programs that work well together. This is a good choice if you have more money than free time, and you want to be up and running within a few weeks. When you have more time or need to accomplish more sophisticated tasks, you can experiment with the virtual machine workflow.
  • Open Source Tools in a Linux Virtual Machine (ca 2014). In this workflow, you build a virtual computer that is customized for, and dedicated to, your research process. This virtual machine runs inside of a window on a Windows or Mac computer. You can also use all the same tools on a Linux computer. You can easily backup, share and version virtual machines, and create new ones that are customized to new projects or particular tasks.
  • A Time for Research Distancing (2020). Alan MacEachern and I co-authored this article for Active History during COVID-19, to help historians turning to digital research for the first time.
  • Topic Clustering Toolkit (2023). James C. Caldwell, a student who is working with me, created this Docker-based system for unsupervised clustering of texts. It is based on Apache Solr and Carrot2 Workbench, and runs on Windows, MacOS, and Linux. Planned additions include Tesseract OCR, language translation, data visualization, and more.

A Workflow for Digital Research Using Off-the-Shelf Tools

Knowing how to program is crucial for doing advanced research with digital sources.  There are still many powerful tools that you can make use of if you don’t know how to program (yet), and even if you do, it usually isn’t a good idea to reinvent the wheel.  The posts below describe a complete workflow for finding, harvesting, clustering, excerpting, and keeping track of digital sources, using programs that run on your own computer.

Going digital. Just getting started? Here are some useful things to know.

Start with a backup and versioning strategy.  If you set up your backup system before you do anything else, you will have more peace of mind, and more freedom to experiment with your process.  You can always undo anything that isn’t working out.

Make everything digital.  Traditional analog sources are only useful to people; digital sources can be manipulated by computers. Get in the habit of digitizing all of the sources that you are working with.

Research 24/7. By making use of RSS feeds, your research process can continue even when you aren’t working.  Computational processes never need to take a break.

Make Local Copies of Everything. Whenever you look at something, you want your computer to have a copy of what you’ve seen. That way, you’ll waste much less time trying to find things again.

Spider to Collect Sources. Harvesters are programs that download all of the files at a particular site; spiders travel from one site to the next, downloading files at your bidding. Both can be used to make your online research more efficient and more powerful.

Burst Documents for Fine-Grained Matching. Searching local documents can be much more efficient if you break them up into individual pages.

Write and Cluster Small Texts. Two programs, DevonThink and Scrivener, lie at the core of the workflow. They allow you to index and search your sources and to write in small, manipulable increments.

Measure and Refactor Constantly. Make continuous small improvements to your methods.

A Workflow for Digital Research Using Open Source Tools in a Debian Linux Virtual Machine

In my graduate digital research methods class, I am pairing this workflow with readings from William E. Shotts, Jr., The Linux Command Line: A Complete Introduction (No Starch Press, 2012). Here is a sample syllabus. Thanks to Devon Elliott, Mary Beth Start and Ian Milligan for suggestions and improvements.

Installing Debian Linux in a Virtual Machine. The first step is to install Oracle VirtualBox and create a Debian Linux machine. (Earlier version: 2013)

Creating the HistoryCrawler VM. HistoryCrawler is a virtual machine that has been customized for doing historical research with digital primary and secondary sources.

Download HistoryCrawler. If you would like to use HistoryCrawler but don’t want to build your own, you can download a copy of the 8Gb virtual machine file from here or here. E-mail me for login details.

Getting Started: VirtualBox and HistoryCrawler, by Mary Beth Start. Starting and stopping the VM, opening a terminal, and editing files with Atom.

Basic Text Analysis with Command Line Tools in Linux. Downloading a text file, standardizing it and determining word frequencies.

Pattern Matching and Permuted Term Indexing with Command Line Tools in Linux. Searching through text files for patterns, using dictionaries for text mining, and building an index of keywords in context.

Batch Downloading and Building Simple Search Engines with Command Line Tools in Linux. Downloading a number of text files automatically, and building a search engine for the collection of sources.

Named Entity Recognition with Command Line Tools in Linux. Using software to automatically extract and count terms that refer to persons, places or organizations.

Doing OCR Using Command Line Tools in Linux. Optical character recognition software converts pictures of printed or typed text into text files.

Working with PDFs Using Command Line Tools in Linux. Extracting pages, text and images from Adobe Acrobat PDFs, and making contact sheets with ImageMagick.

Working with Structured Data Using Command Line Tools in Linux. Manipulating CSV files with short Awk programs and other tools.

Simple XML Parsing and Graph Visualization with Command Line Tools in Linux. Extracting information from XML files and visualizing it graphically with Graphviz.

Writing a Simple Web Spider Using Command Line Tools in Linux. Developing a Bash script to crawl and visualize an online database of linked records.