Archives for the month of: February, 2011

The most important advantage of working with digital representations is that they can be computationally manipulated.  Nothing that happens on the internet would be possible if this were not the case.  Traditional analog sources, on the other hand, can really only be interpreted and used by people.  The upshot of this is that you will get the most benefit from your research if you make it a habit to digitize any source immediately if it is not already in digital form.

For my “super-secret” monograph project, I am working entirely with digital sources.  Since I am a historian, only a relatively small proportion of these were “born digital”.  These are sources that were created by or with digital devices.  Digital photographs and videos, e-mail messages, SMS texts, tweets, computer games, code and server logs are all examples of born digital sources.  They make up an increasing proportion of all of the information in the world.  The bulk of my sources actually began life as analog objects like books, letters, charts and film photographs, and were subsequently digitized by someone else.  In the 1980s and 90s, it seemed as if most traditional sources might never be digitized.  In the last seven years, however, Google alone has digitized more than a tenth of all of the books in the world.  At this point, my money is on the eventual digitization of absolutely everything.  So why not pitch in?

I still read physical papers and books, but instead of underlining them or writing notes on paper, I use an IRISPen to scan the quotes that I am going to want to access later.  The pen scanner is faster than typing notes, even if you are a fast typist.  If I think that the whole source will be useful (more on this in future posts) then I will use a standard desktop scanner to scan whole pages at a time.  If I’m in an archive, I use whatever combination I can of handheld computer, laptop computer, digital camera and flatbed scanner.  When I have research assistants, I ask them to scan sources for me rather than photocopy them.

For documents that are not born digital, the initial phase of digitization only creates digital pictures of text.  Optical character recognition (OCR) is the next crucial step.  I use the full version of Adobe Acrobat (Pro, not the free Reader) to add a text layer to any digital document that does not already have one.  This is useful not only for documents that you digitize yourself, but also for documents that you download from the internet.  Many sites use OCR on their documents to create a text layer for searching purposes, then strip out that layer from the PDFs that they make available for download.  You can use Acrobat Pro to re-scan the document and add the text layer back in, and you will want to make it a habit to do that.

http://upload.wikimedia.org/wikipedia/commons/3/32/Terrakottaarmén.jpg

One advantage of working with digital resources is that they can be duplicated and stored more-or-less for free.  It is tempting to think of backup as a chore that can be deferred, but experience shows that if you try that you may end up putting it off until it is too late.  If you set up your backup system before doing anything else, you have more peace of mind, and you have more freedom to experiment with your process, because you can undo anything that isn’t working out.

For my ‘super-secret’ monograph, I am using a number of different kinds of backup and version control, and my system is intentionally redundant.  First of all, since I am working on a Mac, I have Time Machine set up to copy all of my files to an external hard drive at regular intervals.  If I decide that I need an earlier copy of something, it is usually easiest to check there first.  I also have a script that automatically makes a bootable copy of my machine every night on a different external hard drive using SuperDuper.  If my system dies, I want to be able to plug that drive into another Mac and boot up my working environment without any interruptions or any need to reinstall and customize software.  Occasionally I do want to reinstall software, so I also keep copies of all of the packages that I install in a separate directory (these are the .dmg files on the Mac).

So, if I lose a file, I’m covered.  If a software update breaks something that was working, I’m covered.  If my machine dies, I’m covered.  What happens if my house gets stepped on by Godzilla while I’m out for my daily walk?  (He goes for daily walks, too.)  That’s where offsite backups come in.  I’m using Jungle Disk to make daily copies of essential files in the cloud.  That system is automated so that I don’t have to think about it.  I am a pretty big fan of Jungle Disk… the entire NiCHE server got nuked once, and I was able to put a new copy of our site online in a few hours with no hassle.  I also use Dropbox regularly, to explicitly back up files, to synchronize things between my various computers, and to share document drafts and large data sets with collaborators in the UK.

Free and easy backup is just one advantage of an all-digital workflow; version control is another.  (If this is unfamiliar, see Julie Meloni’s “Gentle Introduction“).  I’m using Git to version my source code and GitHub to share it with other programmers.  I tend not to revert to earlier versions of my own prose, so I don’t use version control for writing, but it is certainly an option.  If you are not programming and are mostly happy with what you write, you can use your backup system as a kind of version control.

"Stealth Bomber Kite", by [F]oxymoron

I didn’t have any plans to write a ‘super-secret’ monograph when I started my sabbatical.  Instead, I assumed that I would mostly be working on desktop fabrication and electronics.  One day I was looking through some old notes that I had taken and I had a crazy idea.  In itself that isn’t unusual, since one of my working principles is to take a note whenever I find a primary source that is weird, funny or salacious.  What was unusual was that I didn’t feel like sharing my crazy idea.  In part, I was afraid that if I told other people about it, they might pooh-pooh it before I had a chance to develop it.  My idea needed buttressing so it could “withstand the assaults of a hostile environment.”  (I think that I read Science in Action too carefully as a student).

Since I’m working entirely with digital sources, I decided to put together a set of interacting programs–an ‘ecology’, if you will–to speed up the process of finding, harvesting, clustering, excerpting, and keeping track of my sources and my attempts to make sense of them.  The original collection of programs consisted of mostly off-the-shelf software tied together with a bit of Automator and Python scripting as necessary: no use in reinventing the wheel.  And what I discovered really surprised me.  Since the mid-1990s I’ve been going around telling people that the digitization of primary sources would change the way that we write history.  I took all of my notes during my PhD on a succession of handheld computers.  I even blogged about the future for three years in a resolutely optimistic mode.  What I hadn’t quite realized is that at some point I should have started using the past tense: that future has already happened.  If you want to work in a new way, everything that you need is ready to hand.  With all-digital sources and a set of tools for working with them, my research process was about an order of magnitude faster than it was on my last monograph.  Instead of asking myself, “Do I really want to spend seven years developing this idea into a book?” the question had become, “Is this idea worth ten months of my time?”  Changing the information and transaction costs meant that doing a crazy project was a lot less of a commitment.  It also meant that if I wanted to, I might write ten times as many books over the course of my lifetime as I had been planning to write.  I still find the process of writing to be difficult, however, so that may not happen.

So before I knew it I was about two years into writing another monograph, at least by the old way of measuring investment, and I still didn’t feel like telling anyone what I was doing.  There are pros and cons to talking about your topic while you’re writing.  The pro is that you get the benefit of other people’s good ideas.  That’s a con, too, because the community tends to shape your work toward a consensus that everyone can agree on.  After reading Infotopia I’m more wary about the wisdom of crowds, including (or especially) the wisdom of peers.  Plenty of time for peer review when I send the manuscript to a press and it goes out to readers.  By that time, I will have already answered most of the objections that I could think of, and the reviewers will help me to distill any excess crazy out of the final product.

I was kind of dreading encountering well-meaning colleagues while on sabbatical, because the first question that everyone asks is “What are you working on?”  Is it rude not to tell them?  So I started telling people the short version of the above story, and have found that colleagues are actually quite supportive and encouraging.  Some think that it is a great marketing gimmick or that I should include a decoder ring with the book.  Some tell me about the drawbacks they’ve encountered in telling other people about their work in progress.  Some like the idea of having complete freedom to change their minds while they’re working.  And instead of sharing the topic, I’ve made it a policy to share my method with anyone who is curious.  We end up having an interesting discussion about research methodology instead of a less interesting discussion about my current idée fixe.  Let’s face it: very few people are as interested in your topic as you are, but many of them are deeply engaged in research practice.

If you’re dying to know about the method you can e-mail me for details, but I will be posting more about it here soon.