History 9877A: Digital Research Methods (Fall 2014)

Historical research now crucially involves the acquisition and use of digital sources. In History 9877A, students learn to find, harvest, manage, excerpt, cluster and analyze digital materials throughout the research process, from initial exploratory forays through the production of an electronic article or monograph which is ready to submit for publication.

Course Description
Students and Auditors
Lesson 01: Linux Virtual Machines
Lesson 02: Basic Text Analysis
Lesson 03: Pattern Matching and Permuted Term Indexing
Lesson 04: Batch Downloading and Building Simple Search Engines
Lesson 05: Named Entity Recognition
Lesson 06: Optical Character Recognition
REVIEW SESSION (22 Oct 2014)
Lesson 07: Working with PDFs
Lesson 08: Structured Data
Lesson 09: XML Parsing and Graph Visualization
Lesson 10: Simple Web Spider
DEMONSTRATION SESSION (26 Nov 2014)

Course Description

Three Scenarios

Why should you take a course in digital research methods?

1. You’ve just returned from a whirlwind trip to the archives. On your laptop you have about nine thousand digital photographs of various documents. You could spend the next few years going through the pictures one at a time and typing notes into a word processor. Or you could write a small script to convert each image into readable text and drop the whole batch into a custom search engine. In less than an hour you could be searching for words and phrases anywhere in your primary sources.

2. You discover that the Internet Archive has a collection of eight hundred online texts that are directly related to your research. You could look through the list of titles in your web browser and click on the links one at a time, scanning each to see if it is relevant. Even if you cut-and-paste notes from the sources to a word processor, it will still take you at least a few months to go through the collection. Or you could write a small script to download all of the sources to your own machine and run a clustering program on them. This sorts the texts into folders of closely related documents, then subdivides those by topic. In less than an hour, you would be able to visualize the contents of the whole collection and focus in on the topics that are of immediate interest to you.

3. You’ve been working with the written corpus of a historically significant figure. You have the books and essays that he or she wrote, their diary entries and their correspondence with a large number of other individuals. How do you make sense of a lifetime of writing? Can you chart important changes in someone’s conceptual world? Spot the emergence of new ideas in the discourse of a community? Map the ever-changing social relations between a network of correspondents?

In this course you will learn to apply techniques that are currently used by fewer than one percent of working historians. Computation won’t magically do your research for you, but it will make you much more efficient. You can focus on close reading, interpretation and writing, and use machines to help you find, summarize, organize and visualize sources.

Prerequisites, Workload, Blogging and Evaluation

There are no prerequisites for the course other than a willingness to learn new things and the perseverance to keep working when you’re confused or when you realize that you could spend a lifetime learning about the topics and technologies that we will cover in class, and still not master them all. Students will come into the course with very different levels of experience and expertise. Some, probably most, will be familiar only with the rudiments of computer and internet use. A few may already be skilled programmers.

This course also requires that you spend at least a little bit of time each day (say 20-30 minutes) practicing your new skills. It’s a lot like learning a new language, learning to play a musical instrument or going to the gym. It is going to be hard at first, but be patient with yourself and ask a lot of questions. With daily practice, you will soon find ways to do your research and coursework faster and more efficiently. If you can’t commit to regular practice, however, you should probably not take this course. The techniques that you learn in this class build cumulatively week-by-week. In addition to regular practice, it is essential that you attend every meeting of the class and do the readings carefully.

Every student in the class will have an academic blog and will be required to make weekly posts to it. These entries do not have to be long (300-500 words per week is ample). The use of blogging is to encourage you to engage in ‘reflective practice,’ that is, to force you to think about your learning and research as you are doing it. It also provides me with feedback for how the course is going. You can use each week’s blog entry to talk about what you learned, things that were clear or not, things you would like to know how to do, and so on.

Before the first class you should go to either WordPress or Blogger (not both) and create an account and a blog. If possible, create the blog under your own name; if not, choose something professional sounding. Post an introductory message about yourself and then send me the URL of your blog so that I can add you to the course blogroll for History 9877A.

You will be graded on your participation in class (20%) and on your reflective blogging (80%). There will be no midterm or final examinations, and no final paper.

Text and Requirements

There is one required text for this course.

Shotts, William E., Jr. The Linux Command Line: A Complete Introduction. No Starch Press, 2012.

In addition, you will need a computer that you can use daily (ideally a laptop that you can bring to every class) and a USB flash drive (32 Gb or larger).

GitHub Repository

https://github.com/williamjturkel/Digital-Research-Methods

Students and Auditors

Lesson 01: Linux Virtual Machines

Overview
- An introduction to the course, the use of virtual machines, and the Linux operating system
Readings
- Shotts Ch 1, What is the Shell?
- (Optional after class) Virtual Box Manual, Ch. 1 First Steps
In-class activity
- Getting Started: VirtualBox and HistoryCrawler by Mary Beth Start
In-class discussion
- Operating systems
  - Windows and Mac (built on Unix)
  - Linux (descended from Unix)
  - Dual-boot machines
- Virtual machines
  - dedicated to your research
  - sits on top of a directory of your sources
  - build new or custom ones any time you need to
  - keep multiple versions and backups
  - run on any computer you have access to
  - synchronize over the cloud
  - scriptable
  - incorporate into extensive workflows
  - there is an exit strategy if you decide you want to return to Mac or Windows
Linux commands to study and practice
- clear, date and cal
- ls
- pwd and cd
- head, tail and less
- file
- cp and mv

Lesson 02: Basic Text Analysis

Overview
- We can get some idea of what a text is about by studying the frequency of words in it
Readings
- Shotts Ch 2, Navigation
- Shotts Ch 3, Exploring the System
In-class activity
- Basic Text Analysis with Command Line Tools in Linux
In-class discussion
- Taxonomy of digital sources
  - text (ASCII, Unicode)
  - markup (XML, HTML, TEI)
  - human-readable and machine-readable
  - open vs. closed / proprietary formats
  - documents (MS Word, PDF)
  - … and lots more we will discuss as course continues
- File naming conventions
  - Lowercase
  - Avoid blank spaces and punctuation except dot (.), dash (-) and underscore(_)
  - Specify dates like YYYYMMDD, e.g., 20130911
  - If you use numbers, left pad with zeros as appropriate. If you will have less than a hundred items, use one leading zero, e.g., 01, 02, …, 42, …, 99. For less than a thousand items, use two leading zeros, e.g., 001, 002, 042, …, 999.
Linux commands to study and practice
- wget
- wc
- tr
- redirection operators: <, >, |

Lesson 03: Pattern Matching and Permuted Term Indexing

Recap and Overview
- Word frequencies can give us some idea of what a text is about
- We can search for particular words or expressions in a text using regular expressions, a powerful pattern matching language
- We can see how a particular word is used by building a permuted index (also called a concordance or a keyword in context (KWIC) listing)
Readings
- Shotts Ch 4, Manipulating Files and Directories
- Shotts Ch 5, Working with Commands
- Shotts Ch 6, Redirection
In-class activity
- Create folder for week02 and clean up previous week’s work
- Pattern Matching and Permuted Term Indexing with Command Line Tools in Linux
In-class discussion
- Using wc to confirm that your file has the same characteristics as another one
- Fingerprinting documents, preventing bit rot, ensuring accurate communication: checksums, error correcting codes and cryptographic hashes
- The supposed plasticity of digital representation
- Regular expressions are very powerful, but they are not a good theoretical model for the structure of human natural languages
Linux commands to study and practice
- mkdir
- cp, mv and rm
- filename wildcards
- character classes (try using these with tr and egrep too)
- type, which, man, apropos, whatis
- wget
- cat, head, tail and less
- sort
- uniq
- ptx
- redirection operators: <, >, >>, |
- /dev/null

Lesson 04: Batch Downloading and Building Simple Search Engines

Recap and Overview
- We can download any online text and begin to make sense of it by analyzing word frequencies, searching for regular expressions and building a concordance
- We can download arbitrarily large collections of sources automatically
- Unlike our previous methods, a search engine can return sources ranked by relevance to a particular query
Readings
- Shotts Ch 7, Seeing the World as the Shell Sees It
- Shotts Ch 8, Advanced Keyboard Tricks
In-class activity
- Clean up files from previous activities with
  mv $(ls –ignore=week*) week03
- Batch Downloading and Building Simple Search Engines with Command Line Tools in Linux
In-class discussion
- rtfm
- Googling the error
- Useful sites: Linux Documentation Project, Debian Books, Stack Overflow
- How do we figure out whether a document is relevant to a query?
- Why burst long documents into little pieces before putting them in a search engine?
Linux commands to study and practice
- cat and echo
- brace expansion
- command substitution
- backslash escape sequences
- alias and unalias
- clear
- history
- tab completion
- wget
  - Milligan, Automated Downloading with wget from the Programming Historian
- split
- rename
- swish-e
  - Boolean Operators
  - Running Swish-e and Command Line Switches

Lesson 05: Named Entity Recognition

Recap and Overview
- We can automatically download arbitrarily large batches of files
- We have a variety of techniques for analyzing text and finding patterns in it: word frequencies, concordances, regular expressions, search engines
- A named entity recognizer is a program that goes through a text and tries to guess which words represent people, organizations, places or other kinds of entity
Readings
- Shotts Ch 19, Regular Expressions
- (Optional) Shotts Ch 12, A Gentle Introduction to vi
In-class activity
- Named Entity Recognition with Command Line Tools in Linux
In-class discussion
- Using probabilistic models for natural language
Linux commands to study and practice
- grep and variants
- (Optional) vi
- (Optional) vimtutor
- (Optional) Graphical vi/vim Cheatsheet and Tutorial

Lesson 06: Optical Character Recognition

Recap and Overview
- We have learned a variety of techniques for downloading texts, analyzing them and finding patterns in them
- We can extract the (printed or typescript) text in photographs or digital scans of documents using OCR (optical character recognition)
- Approximate regular expressions allow us to find terms that are close to a pattern, rather than matching it exactly
Readings
- Shotts Ch 17, Searching for Files
- Shotts Ch 18, Archiving and Backup
- Shotts Ch 20, Text Processing
In-class activity
- Doing OCR Using Command Line Tools in Linux
In-class discussion
- Taxonomy of digital sources, continued
  - born-digital vs. digitized
  - pictures of text (scanner, digital camera)
  - OCR: optical character recognition
- What kind of errors are introduced by OCR?
- How does approximate / fuzzy pattern matching work?
Linux commands to study and practice
- locate and find
- touch
- xargs
- gzip, gunzip, zcat and zless
- zip and unzip
- bzip2
- tar
- cut and paste
- join
- comm and diff
- convert, display and identify (ImageMagick)
  - Examples
- tesseract
- tre-agrep

REVIEW SESSION (22 Oct 2014)

Lesson 07: Working with PDFs

Recap and Overview
- We have a wide variety of tools for working with texts
- We can use optical character recognition (OCR) to extract printed or typeset text from digital images of documents
- We can extract text, images, page images and full pages from PDFs with command line tools
Readings
- xkcd, “Sandwich“
- Shotts Ch 9, Permissions
- Shotts Ch 10, Processes
In-class activity
- Working with PDFs Using Command Line Tools in Linux
In-class discussion
- How are PDFs structured?
Linux commands to study and practice
- chmod
- sudo
- chown and chgrp
- ps
- kill
- pdftotext and pdfimages
- pdftk
  - Examples
  - Man page
- montage (ImageMagick)
  - Examples

Lesson 08: Structured Data

Recap and Overview
- Linux has a wide variety of tools for manipulating and analyzing text files
- Text files can be used to store tabular data or database records that are partially or completely numeric
- The Awk programming language is standard in Linux and UNIX and is a good choice for manipulating structured data files
Readings
- Shotts Ch 23, Compiling Programs
- Philip Brown, “Awk Programming“, “Awk Programming Lesson 2“, “Awk Programming Lesson 3“
- Eric Wendelin, “Awk is a Beautiful Tool“
In-class activity
- Working with Structured Data Using Command Line Tools in Linux
In-class discussion
- What makes data ‘structured’? What are unique identifiers?
- Some strategies for learning how to program
Linux commands to study and practice
- cut, paste and join
- csvfix
  - Online Manual
- awk
  - Peteris Krumims, “Famous Awk One-Liners Explained, Part 1“, “Part 2“, “Part 3“
  - GNU Awk Manual

Lesson 09: XML Parsing and Graph Visualization

Recap and Overview
- One way to structure information inside of a human- and machine-readable text file is to store it in a table of rows and columns; not all information lends itself to this, however
- A different way to structure information in text files is to use a markup language like HTML or XML
- Computer programs can easily extract information from text files if it has been explicitly tagged
Readings
- David J. Birnbaum, “What is XML and Why Should Humanists Care?“
In-class activity
- Simple XML Parsing and Graph Visualization with Command Line Tools in Linux
In-class discussion
- What kinds of information can’t be represented in rows and columns?
- Is writing markup the same as programming?
Linux commands to study and practice
- xmlstarlet
  - online manual
- graphviz
  - about
  - documentation
  - gallery (you can click on a picture of a graph to see the code that generated it)

Lesson 10: Simple Web Spider

Recap and Overview
- Computer programs can extract information from text files if it has been explicitly tagged with XML or some other markup language
- A web spider (also known as a crawler or bot) is a program that downloads a web page and processes it, extracts links to other web pages, and follows each in turn, processing them
Readings
- Shotts Ch 24, Writing Your First Script
- Shotts Ch 25, Starting a Project
- Shotts Ch 26, Top-Down Design
- Shotts Ch 27, Flow Control: Branching with If
In-class activity
- Writing a Simple Web Spider Using Command Line Tools in Linux
In-class discussion
- Problem, algorithm and implementation
Linux commands to study and practice
- Review wget, xmlstarlet and graphviz
- Bash scripting: variables, for loops, if statements

DEMONSTRATION SESSION

Readings
- Milligan, Historians Love JSON, or One Quick Example of Why It Rocks
- Marti, Canadiana in Context
- (Optional) Healy, Using Metadata to Find Paul Revere
Learn More
- jq (command line JSON processor) Tutorials, Manual and Sandbox
- ProPublica, Using Google Refine to Clean Messy Data
- Small Data Journalism, Intro to Data Mashing and Mapping with Google Fusion Tables
- Martin Grandjean, Introduction to Network Visualization with Gephi

History 9877A: Digital Research Methods (Fall 2014)

Course Description

Students and Auditors

Lesson 01: Linux Virtual Machines

Lesson 02: Basic Text Analysis

Lesson 03: Pattern Matching and Permuted Term Indexing

Lesson 04: Batch Downloading and Building Simple Search Engines

Lesson 05: Named Entity Recognition

Lesson 06: Optical Character Recognition

REVIEW SESSION (22 Oct 2014)

Lesson 07: Working with PDFs

Lesson 08: Structured Data

Lesson 09: XML Parsing and Graph Visualization

Lesson 10: Simple Web Spider

DEMONSTRATION SESSION

Projects

Recent

Archives

Categories

History 9877A: Digital Research Methods (Fall 2014)

Course Description

Students and Auditors

Lesson 01: Linux Virtual Machines

Lesson 02: Basic Text Analysis

Lesson 03: Pattern Matching and Permuted Term Indexing

Lesson 04: Batch Downloading and Building Simple Search Engines

Lesson 05: Named Entity Recognition

Lesson 06: Optical Character Recognition

REVIEW SESSION (22 Oct 2014)

Lesson 07: Working with PDFs

Lesson 08: Structured Data

Lesson 09: XML Parsing and Graph Visualization

Lesson 10: Simple Web Spider

DEMONSTRATION SESSION

Share this:

Projects

Recent

Archives

Categories