349491600_77c31971a1_z-edited

Recap

In a previous post, we used wget to download a Project Gutenberg ebook from the Internet Archive, then cleaned up the file using the sed and tr commands. The code below puts all of the commands we used into a pipeline.

wget http://archive.org/download/thestoriesmother05792gut/stmtn10.txt
cat stmtn10.txt | tr -d '\r' | sed '2206,2525d' | sed '1,40d' > stmtn10-trimmedlf.txt

Using a pipeline of commands that we have already seen, we can also create a list of words in our ebook, one per line, sorted alphabetically. In English, the command below says “send the stmtn10-trimmedlf.txt file into a pipeline that translates uppercase characters into lowercase, translates hyphens into blank spaces, translates apostrophes into blank spaces, deletes all other punctuation, puts one word per line, sorts the words alphabetically, removes all duplicates and writes the resulting wordlist to a file called stmtn10-wordlist.txt“.

cat stmtn10-trimmedlf.txt | tr [:upper:] [:lower:] | tr '-' ' ' | tr "'" " " | tr -d [:punct:] | tr ' ' '\n' | sort | uniq > stmtn10-wordlist.txt

Note that we have to use double quotes around the tr expression that contains a single quote (i.e., apostrophe), so that the shell does not get confused about the arguments we are providing. Use less to explore stmtn10-wordlist.txt.

Dictionaries

Typically when you install Linux, at least one natural language dictionary is installed. Each dictionary is simply a text file that contains an alphabetical listing of ‘words’ in the language, one per line. The dictionaries are used for programs that do spell checking, but they are also a nice resource that can be used for text mining and other tasks. You will find them in the folder /usr/share/dict. I chose American English as my language when I installed Linux, so I have a dictionary called /usr/share/dict/american-english.

Suppose you are reading through a handwritten document and you come across a word that begins with an s, has two or three characters that you can’t make out, and ends in th.  You can use grep to search for the pattern in your dictionary to get some suggestions.

grep -E "^s.{2,3}th$" /usr/share/dict/american-english

The computer responds with the following.

saith
sheath
sixth
sleuth
sloth
smith
smooth
sooth
south
swath

In the statement above, the caret (^) and dollar sign ($) stand for the beginning and end of line respectively. Since each line in the dictionary file consists of a single word, we get words back. The dot (.) stands for a single character, and the pair of numbers in curly braces ({n,m}) say we are trying to match at least n characters and at most m.

Linux actually has a family of grep commands that match common options. There is a command called egrep, for example, which is equivalent to using grep -E, to match an extended set of patterns.  There is a command called fgrep which is a fast way to search for fixed strings (rather than patterns). We will use both egrep and fgrep in examples below. As with any Linux command, you can learn more about command line options with man.

Words in our text that aren’t in the dictionary

One way to use a dictionary for text mining is to get a list of words that appear in the text but are not listed in the dictionary. We can do this using the fgrep command, as show below. In English, the command says “using the file /usr/share/dict/american-english as a source of strings to match (-f option), find all the words (-w option) in stmtn10-wordlist.txt that are not in the list of strings (-v option) and send the results to the file stmtn10-nondictwords.txt“.

fgrep -v -w -f /usr/share/dict/american-english stmtn10-wordlist.txt > stmtn10-nondictwords.txt

Use the less command to explore stmtn10-nondictwords.txt. Note that it contains years (1861, 1865), proper names (alice, andrews, charles), toponyms (america, boston, calcutta) and British / Canadian spellings (centre, fibres). Note that it also includes a lot of specialized vocabulary which gives us some sense of what this text may be about: coccinea (a kind of plant), coraltown, cornfield, ferny, flatheads, goshawk, hepaticas (another plant), pitchy, quercus (yet another plant), seaweeds, and so on. Two interesting ‘words’ in this list are cucuie and ea. Use grep on stmtn10-trimmedlf.txt to figure out what they are.

Matching patterns within and across words

The grep command and its variants are useful for matching patterns both within a word and across a sequence of words. If we wanted to find all of the examples in our original text that contain an apostrophe s, we would use the command below. Note that the –color option colors the portion of the text that matches our pattern.

egrep --color "'s " stmtn10-trimmedlf.txt

If we wanted to find contractions, we could change the pattern to “‘t “, and if we wanted to match both we would use “‘. “ (this would also match abbreviations like discover’d).

We could search for the use of particular kinds of words. Which, for example, contain three vowels in a row?

egrep --color "[aeiou]{3}" stmtn10-trimmedlf.txt

We can also use egrep to search for particular kinds of phrases. For example, we could look for use of the first person pronoun in conjunction with English modal verbs.

egrep --color "I (can|could|dare|may|might|must|need|ought|shall|should|will|would)" stmtn10-trimmedlf.txt

Or we could see which pronouns are used with the word must:

egrep --color "(I|you|he|she|it|they) must" stmtn10-trimmedlf.txt

Spend some time using egrep to search for particular kinds of words and phrases. For example, how would you find regular past tense verbs (ones that end in -ed)? Years? Questions? Quotations?

Keywords in context

If you are interested in studying how particular words are used in a text, it is usually a good idea to build a concordance. At the Linux command line, this can be done easily using the ptx command, which builds a permuted term index. The command below uses the -f option to fold lowercase to uppercase for sorting purposes, and the -w option to set the width of our output to 50 characters.

ptx -f -w 50 stmtn10-trimmedlf.txt > stmtn10-ptx.txt

The output is stored in the file stmtn10-ptx.txt, which you can explore with less or search with grep.

If we want to find the word ‘giant’, for example, we might start with the following command. The -i option tells egrep to ignore case, so we get uppercase, lowercase and mixed case results.

egrep -i "[[:alpha:]]   giant" stmtn10-ptx.txt

Note that the word ‘giant’ occurs in many of the index entries. By preceding it with any alphabetic character, followed by three blank spaces, we see only those entries where ‘giant’ is the keyword in context. (Try grepping stmtn10-ptx.txt for the pattern “giant” to see what I mean.)

As a more detailed example, we might try grepping through our permuted term index to see if the author uses gendered pronouns differently.  Start by creating two files of pronouns in context.

egrep -i "[[:alpha:]]   (he|him|his) " stmtn10-ptx.txt > stmtn10-male.txt
egrep -i "[[:alpha:]]   (she|her|hers) " stmtn10-ptx.txt > stmtn10-female.txt

Now you can use wc -l to count the number of lines in each file, and less to page through them. We can also search both files together for interesting patterns.  If we type in the following command

cat *male* | egrep "   (he|she) .*ed"

we find “she died” and “she needs” versus “he toiled”, “he sighed”, “he flapped”, “he worked”, “he lifted”, “he dared”, “he lived”, “he pushed”, “he wanted” and “he packed”.

words-6222159602_be25e99546_z-edited

Introduction

In the Linux and Unix operating systems, everything is treated as a file. Whenever possible, those files are stored as human- and machine-readable text files. As a result, Linux contains a large number of tools that are specialized for working with texts. Here we will use a few of these tools to explore a textual source.

Downloading a text

Our first task is to obtain a sample text to analyze. We will be working with a nineteenth-century book from the Internet Archive: Jane Andrews, The Stories Mother Nature Told Her Children (1888, 1894). Since this text is part of the Project Gutenberg collection, it was typed in by humans, rather than being scanned and OCRed by machine. This greatly reduces the number of textual errors we expect to find in it.  To download the file, we will use the wget command, which needs a URL. We don’t want to give the program the URL that we use to read the file in our browser, because if we do the file that we download will have HTML markup tags in it. Instead, we want the raw text file, which is located at


http://archive.org/download/thestoriesmother05792gut/stmtn10.txt

First we download the file with wget, then we use the ls command (list directory contents) to make sure that we have a local copy.

wget http://archive.org/download/thestoriesmother05792gut/stmtn10.txt
ls

Our first view of the text

The Linux file command allows us to confirm that we have downloaded a text file. When we type

file stmtn10.txt

the computer responds with

stmtn10.txt: C source, ASCII text, with CRLF line terminators

The output of the file command confirms that this is an ASCII text (which we expect), guesses that it is some code in the C programming language (which is incorrect) and tells us that the ends of the lines in the file are coded with both a carriage return and a line feed. This is standard for Windows computers. Linux and OS X expect the ends of lines in an ASCII text file to be coded only with a line feed. If we want to move text files between operating systems, this is one thing we have to pay attention to. Later we will learn one method to convert the line endings from CRLF to LF, but for now we can leave the file as it is.

[UPDATE 2014. The file command no longer mistakenly identifies the file as C code.]

The head and tail commands show us the first few and last few lines of the file respectively.

head stmtn10.txt
The Project Gutenberg EBook of The Stories Mother Nature Told Her Children
by Jane Andrews

Copyright laws are changing all over the world. Be sure to check the
copyright laws for your country before downloading or redistributing
this or any other Project Gutenberg eBook.

This header should be the first thing seen when viewing this Project
Gutenberg file.  Please do not remove it.  Do not change or edit the
header without written permission.
tail stmtn10.txt

[Portions of this eBook's header and trailer may be reprinted only
when distributed free of all fees.  Copyright (C) 2001, 2002 by
Michael S. Hart.  Project Gutenberg is a TradeMark and may not be
used in any sales of Project Gutenberg eBooks or other materials be
they hardware or software or any other related product without
express permission.]

*END THE SMALL PRINT! FOR PUBLIC DOMAIN EBOOKS*Ver.02/11/02*END*

As we can see, the Project Gutenberg text includes some material in the header and footer which we will probably want to remove so we can analyze the source itself. Before modifying files, it is usually a good idea to make a copy of the original. We can do this with the cp command, then use the ls command to make sure we now have two copies of the file.

cp stmtn10.txt stmtn10-backup.txt
ls

In order to have a look at the whole file, we can use the less command. Once we run the following statement, we will be able to use the arrow keys to move up and down in the file one line at a time (or the j and k keys); the page up and page down keys to jump by pages (or the f and b keys); and the forward slash key to search for something (try typing /giant for example and then press the n key to see the next match). Press the q key to exit from viewing the file with less.

less -N stmtn10.txt

Trimming the header and footer

In the above case, we used the option -N to tell the less command that we wanted it to include line numbers at the beginning of each line. (Try running the less command without that option to see the difference.) Using the line numbers, we can see that the Project Gutenberg header runs from Line 1 to Line 40 inclusive, and that the footer runs from Line 2206 to Line 2525 inclusive. To create a copy of the text that has the header and footer removed, we can use the Linux stream editor sed. We have to start with the footer, because if we removed the header first it would change the line numbering for the rest of the file.

sed '2206,2525d' stmtn10.txt > stmtn10-nofooter.txt

This command tells sed to delete all of the material between lines 2206 and 2525 and output the results to a file called stmtn10-nofooter.txt. You can use less to confirm that this new file still contains the Project Gutenberg header but not the footer. We can now trim the header from this file to create another version with no header or footer. We will call this file stmtn10-trimmed.txt. Use less to confirm that it looks the way it should. While you are using less to view a file, you can use the g key to jump to the top of the file and the shift-g to jump to the bottom.

sed '1,40d' stmtn10-nofooter.txt > stmtn10-trimmed.txt

Use the ls command to confirm that you now have four files, stmtn10-backup.txtstmtn10-nofooter.txtstmtn10-trimmed.txt and stmtn10.txt.

A few basic statistics

We can use the wc command to find out how many lines (-l option) and how many characters (-m) our file has. Running the following shows us that the answer is 2165 lines and 121038 characters.

wc -l stmtn10-trimmed.txt
wc -m stmtn10-trimmed.txt

Finding patterns

Linux has a very powerful pattern-matching command called grep, which we will use frequently. At its most basic, grep returns lines in a file which match a pattern. The command below shows us lines which contain the word giant. The -n option asks grep to include line numbers. Note that this pattern is case sensitive, and will not match Giant.

grep -n "giant" stmtn10-trimmed.txt
1115:Do you believe in giants? No, do you say? Well, listen to my story,
1138:to admit that to do it needed a giant's strength, and so they deserve
1214:giants think of doing. We have not long to wait before we shall see, and

What if we wanted to find both capitalized and lowercase versions of the word? In the following command, we tell grep that we want to use an extended set of possible patterns (the -E option) and show us line numbers (the -n option). The pattern itself says to match something that starts either with a capital G or a lowercase g, followed by lowercase iant.

grep -E -n "(G|g)iant" stmtn10-trimmed.txt

Creating a standardized version of the text

When we are analyzing the words in a text, it is usually convenient to create a standardized version that eliminates whitespace and punctuation and converts all characters to lowercase. We will use the tr command to translate and delete characters of our trimmed text, to create a standardized version. First we delete all punctuation, using the -d option and a special pattern which matches punctuation characters. Note that in this case the tr command requires that we use the redirection operators to specify both the input file (<) and the output file (>). You can use the less command to confirm that the punctuation has been removed.

tr -d [:punct:] < stmtn10-trimmed.txt > stmtn10-nopunct.txt

The next step is to use tr to convert all characters to lowercase. Once again, use the less command to confirm that the changes have been made.

tr [:upper:] [:lower:] < stmtn10-nopunct.txt > stmtn10-lowercase.txt

Finally, we will use the tr command to convert all of the Windows CRLF line endings to the LF line endings that characterize Linux and OS X files. If we don’t do this, the spurious carriage return characters will interfere with our frequency counts.

tr -d '\r' < stmtn10-lowercase.txt > stmtn10-lowercaself.txt

Counting word frequencies

The first step in counting word frequencies is use the tr command to translate each blank space into an end-of-line character (or newline, represented by \n). This gives us a file where each word is on its own line. Confirm this using the less or head command on stmtn10-oneword.txt.

tr ' ' '\n' < stmtn10-lowercaself.txt > stmtn10-oneword.txt

The next step is to sort that file so the words are in alphabetical order, and so that if a given word appears a number of times, these are listed one after another. Once again, use the less command to look at the resulting file. Note that there are many blank lines at the beginning of this file, but if you page down you start to see the words: a lot of copies of a, followed by one copy of abashed, one of ability, and so on.

sort stmtn10-oneword.txt > stmtn10-onewordsort.txt

Now we use the uniq command with the -c option to count the number of repetitions of each line. This will give us a file where the words are listed alphabetically, each preceded by its frequency. We use the head command to look at the first few lines of our word frequency file.

uniq -c stmtn10-onewordsort.txt > stmtn10-wordfreq.txt
head stmtn10-wordfreq.txt
    358
      1 1861
      1 1865
      1 1888
      1 1894
    426 a
      1 abashed
      1 ability
      4 able
     44 about

Pipelines

When using the tr command, we saw that it is possible to tell a Linux command where it is getting its input from and where it is sending its output to. It is also possible to arrange commands in a pipeline so that the output of one stage feeds into the input of the next. To do this, we use the pipe operator (|). For example, we can create a pipeline to go from our lowercase file (with Linux LF endings) to word frequencies directly, as shown below. This way we don’t create a bunch of intermediate files if we don’t want to. You can use the less command to confirm that stmtn10-wordfreq.txt and stmtn10-wordfreq2.txt look the same.

tr ' ' '\n' < stmtn10-lowercaself.txt | sort | uniq -c > stmtn10-wordfreq2.txt

When we use less to look at one of our word frequency files, we can search for a particular term with the forward slash. Trying /giant, for example, shows us that there are sixteen instances of the word giants in our text. Spend some time exploring the original text and the word frequency file with less.

Many digital humanists are probably aware that they could make their research activities faster and more efficient by working at the command line. Many are probably also sympathetic to arguments for open source, open content and open access. Nevertheless, switching to Linux full-time is a big commitment. Virtualization software, like Oracle’s free VirtualBox, allows one to create Linux machines that run inside a window on a Mac or PC. Since these virtual machines can be created from scratch whenever you need one, they make an ideal platform for learning command line techniques. They can also be customized for particular research tasks, as I will show in later posts.

In this post I show how to create a stripped-down Debian Linux virtual machine inside VirtualBox. It does not have a GUI desktop installed, so you have to interact with it through commands entered in a shell (you can add your own GUI later, if you’d like). The screenshots come from a Mac, but the install process should be basically the same for a Windows PC.

To get started, you need to download two things.  The first of these is a disk image file (ISO) for the version of Linux you want to install.  These files are different depending on the processor in your computer.  For a recent Windows or Mac desktop (i.e., a 64-bit Intel machine), the file that you probably want is debian-testing-amd64-CD-1.iso.  For older machines, you may need a different disk image.  Check the Debian distribution page for more details. The other thing that you need to download is the Oracle VirtualBox software for your operating system. Once you have downloaded VirtualBox, install it and then start it.

The image below shows the VirtualBox Manager running on my Mac. I have already created three other Linux virtual machines, but we can ignore these.

00-virtualbox-managerTo create a new virtual machine, click the “New” button in the upper left hand corner of the Manager. Debian Linux comes in three standard flavours, known as “stable,” which is very solid but not very up-to-date, “testing,” which is pretty solid and reasonably up-to-date, and “unstable,” which is just that. The current code name for the testing version is “Wheezy”.  I like to name each of my virtual machines so I know what version of the operating system I am using.  I’m going to call this one “VBDebianWheezy64.”  You can call yours whatever you’d like.

01-create-virtual-machineOnce you click “Continue,” the VirtualBox software will ask you a number of questions. For this installation we can use the default recommendations: a memory size of 384 megabytes of RAM, a virtual hard drive formatted as a VDI (VirtualBox Disk Image), dynamically allocated disk storage, and 8 gigabytes for the virtual machine.

02-memory-size

03-virtual-hard-drive

04-virtualbox-disk-image

05-dynamically-allocated

06-file-location-size

Once we have set all of the options for the virtual machine, we are returned to the VirtualBox Manager.

07-virtualbox-createdNow we choose the virtual machine we just created and click the “Start” button in the Manager. The new machine starts with a message about how the mouse is handled when the cursor is over the virtual machine window.

08-auto-capture-keyboard

Once you’ve read and accepted the message, the virtual machine will ask you for a start-up disk.

09-select-startup-diskClick the file icon with the green up arrow on it, and you will be given a dialog that lets you choose the Debian ISO file you downloaded earlier.

10-choosing-iso-fileThe ISO file is now selected.

11-start-iso-file-installation

When you click “start” the Debian Install process will begin in the virtual machine window.

12-debian-install

You can move around the installer options with the Up and Down arrows and Tab key. Use the Enter key to select an item. If there are options, you can usually turn them on or off with the Space bar. Here, press Enter to choose the “Install” option.

Next you want to select your language, location and preferred keyboard layout.

13-choose-language

14-choose-location

15-choose-keyboardThe installer will ask you for a hostname and a domain name. You can set the former to whatever you’d like; leave the latter blank unless you have a reason to set it.

16-choose-hostname

17-blank-domain-nameNext, the installer will ask you for a root password. In Linux and Unix systems, the root account typically has the power to do everything, good and bad. Rather than setting a root password, we are going to leave the root password entry blank. The installer will respond by not creating a root account, but rather by giving the user account (i.e., you) sudo privileges.

18-blank-root-password

19-confirm-blank-root-passwordNow that the root account is disabled, you can enter your own name, username and password, and set the time zone.

20-set-user

21-set-username

22-set-user-password

23-confirm-user-password

24-set-timezoneThe next set of screens ask you to specify how you would like the file system to be set up. As before, we will use the defaults. Later, when you are more familiar with creating virtual machines for specific tasks, you can tweak these as desired. We want guided partitioning, and we are going to use the entire virtual disk (this is the 8Gb dedicated to this particular virtual machine)

25-guided-entire-disk-partitionWe only have one disk to partition, so we choose it.

26-partition-disksWe want all of our files in one partition for now.  Later, if you decide to do a lot of experimentation with Linux you may prefer to put your stuff in separate partitions when you create new virtual machines.

27-all-files-one-partitionWe can finish the partitioning…

28-finish-partitioning

and write the changes to disk.

29-write-changes-to-diskNow the install process will ask us if we want to use other disk image files.  We do not.

30-dont-scan-another-diskWe are going to grab install files from the Internet instead of from install disk images. (If you are working in a setting where downloads are expensive, you may not wish to do this.) We set up a network mirror to provide the install files.

31-use-network-mirror

Tell the installer what country you are in.

32-choose-mirror-countryThen choose a Debian archive mirror. The default mirror is a good choice.

33-choose-mirrorNow the installer will ask if we want to use a proxy server. Leave this blank unless you have a reason to change it.

34-blank-http-proxyI opt out of the popularity contest.

35-no-popularity-contestDebian gives you a lot of options for pre-installed bundles of software.  On a desktop, I choose only the “Standard system utilities.” If I am on a laptop, I also include the “Laptop” bundle. I leave all of the other ones unchecked.  (You can always install more software later.) The “Debian desktop environment” is the GUI, which is mouse-and-icon based, like Windows and OS X.  I have found it is much easier to get in the habit of using command line tools if you don’t bother with the GUI, at least at first.

36-software-selectionThe final step is to install the Grub bootloader.

37-install-grub-bootloader

Now the virtual machine will reboot when you click “Continue”.

38-finish-installationThis is the login prompt for your new Debian virtual machine.

39-debian-virtual-machine-loginYou can use Linux commands to shutdown the virtual machine if you would like.  You can also save it in such a way that it will resume where you left off when you reload it in VirtualBox. In the VirtualBox Manager, right click on the virtual machine and choose “Close”->”Save State”. That is shown in the next screenshot.

40-close-vmYou can save backups of your virtual machine whenever you reach a crucial point in your work, store VMs in the cloud, and share them with colleagues or students. You can also create different virtual machines for different tasks and use them to try out other Linux distributions. On my Macs, I also have Win XP and Win 7 VMs so I can run Windows-only software.

 

arseneauj-3d-imaging-fail-edited

For a couple of years I have been working on outfitting the History Department at Western University with a new digital lab and classroom, funded by a very generous grant from our provost. The spaces are now open and mostly set up, and our graduate students and faculty have started to form working groups to teach themselves how to use the hardware and software and to share what they know with others. There is tremendous excitement about the potential of our lab, which is understandable. I believe that it is the best-equipped such space in the world: historians at Western now have their own complete Fab Lab.

In provisioning the lab and classroom, I wanted to strike a balance between supporting the kinds of activities that are typically undertaken in digital history and digital humanities projects right now, while also enabling our students and faculty to engage in the kind of “making in public” that many people argue will characterize the humanities and social sciences in the next decade.

Here is a high-level sketch of our facilities, organized by activity. The lab inventory actually runs to thousands of items, so this just an overview.

composite

To date, the facilities have been used most fully by Devon Elliott, a PhD student who is working with Rob MacDougall and I. Devon’s dissertation is on the technology and culture of stage magic. In his work, he designs electronics, programs computers, does 3D scanning, modeling and printing, builds illusions and installations and leads workshops all over the place. You can learn more about his practice in a recent edition of the Canadian Journal of Communication and in the forthcoming #pastplay book edited by Kevin Kee. (Neither of these publications are open access yet, but you can email me for preprints.) Devon and I are also teaching a course on fabrication and physical computing at DHSI this summer with Jentery Sayers and Shaun Macpherson.

Past students in my interactive exhibit design course have also used the lab equipment to build dozens of projects, including a robot that plays a tabletop hockey game, a suitcase that tells the stories of immigrants, a batting helmet to immerse the user in baseball history, a device that lets the Prime Ministers on Canadian money tell you about themselves, a stuffed penguin in search of the South Pole, and many others. This year, students in the same class have begun to imagine drumming robots, print 3D replicas of museum artifacts, and make the things around them responsive to people.

In the long run, of course, the real measure of the space will be what kind of work comes out of it. While I don’t really subscribe to the motto “if you build it, they will come”, I do believe that historians who want to work with their hands as well as their heads have very few opportunities to do so. We welcome you! We’re very interested in taking student and post-doc makers and in collaborating with colleagues who are dying to build something tangible. Get excited and make things!

(If you have Mathematica you can download this as a notebook from my GitHub account. It is also available as a CDF document which can be read with Wolfram’s free CDF Player.)

Introduction

For a couple of years now I have been using Mathematica as my programming language of choice for my digital history work. For one thing, I love working with notebooks, which allow me to mix prose, citations, live data, executable code, manipulable simulations and other elements in a single document. I also love the generality of Mathematica. For any kind of technical work, there is usually a well-developed body of theory that is expressed in objects drawn from some branch of mathematics. Chances are, Mathematica already has a large number of high-level functions for working with those mathematical objects. The Mathematica documentation is excellent, if necessarily sprawling, since there are literally thousands of commands. The challenge is usually to find the commands that you need to solve a given problem. Since few Mathematica programmers seem to be working historians or humanists dealing with textual sources, it can be difficult to figure out where to begin.

Using a built-in text

As a sample text, we will use the Darwin’s Origin of Species from Mathematica‘s built-in example database. The Short command shows a small piece of something large. Here we’re asking to see the two lines at the beginning and end of this text.

sample = ExampleData[{"Text", "OriginOfSpecies"}];
Short[sample, 2]

Mathematica responds with

INTRODUCTION. When on board H.M.S. ... have been, and are being, evolved.

The Head command tells us what something is. Our text is currently a string, an ordered sequence of characters.

Head[sample]
String

Extracting part of a string

Suppose we want to work with part of the text. We can extract the Introduction of Origin by pulling out everything between “INTRODUCTION” and “CHAPTER 1″. The command that we use to extract part of a string is called StringCases. Once we have extracted the Introduction, we want to check to make sure that the command worked the way we expected. Rather than look at the whole text right now, we can use the Short command to show us about five line of the text. It returns a couple of phrases at the beginning and end, using ellipses to indicate the much larger portion which we are not seeing.

intro=StringCases[sample,Shortest["INTRODUCTION"~~__~~"CHAPTER"]][[1]];
Short[intro,5]
INTRODUCTION. When on board H.M.S. 'Beagle,' as naturalist, I was much struck with certain fac... nced that Natural Selection has been the main but not exclusive means of modification. CHAPTER

Note the use of the Shortest command in the string matching expression above. Since there are probably multiple copies of the word “CHAPTER” in the text, we have to tell Mathematica how much of the text we want to match… do we want the portion between “INTRODUCTION” and the first instance of the word, the second, the last? Here are two examples to consider:

StringCases["bananarama","b"~~__~~"a"]
{bananarama}
StringCases["bananarama",Shortest["b"~~__~~"a"]]
{bana}

From a string to a list of words

It will be easier for us to analyze the text if we turn it into a list of words. In order to eliminate punctuation, I am going to get rid of everything that is not a word character. Note that doing things this way turns the abbreviation “H.M.S.” into three separate words.

introList=StringSplit[intro,Except[WordCharacter]..];
Short[introList,4]
{INTRODUCTION,When,on,board,H,M,S,Beagle,as,<>,the,main,but,not,exclusive,means,of,modification,CHAPTER}

Mathematica has a number of commands for selecting elements from lists. The Take command allows us to extract a given number of items from the beginning of a list.

Take[introList,40]
{INTRODUCTION,When,on,board,H,M,S,Beagle,as,naturalist,I,was,much,struck,
with,certain,facts,in,the,distribution,of,the,inhabitants,of,South,America,
and,in,the,geological,relations,of,the,present,to,the,past,inhabitants,
of,that}

The First command returns the first item in a list, and the Rest command returns everything but the first element. The Last command returns the last item.

First[introList]
INTRODUCTION
Short[Rest[introList]]
{When,on,<>,modification,CHAPTER}
Last[introList]
CHAPTER

We can also use an index to pull out list elements.

introList[[8]]
Beagle

We can test whether or not a given item is a member of a list with the MemberQ command.

MemberQ[introList, "naturalist"]
True
MemberQ[introList, "naturist"]
False

Processing each element in a list

If we want to apply some kind of function to every element in a list, the most natural way to accomplish this in Mathematica is with the Map command. Here we show three examples using the first 40 words of the Introduction. Note that Map returns a new list rather than altering the original one.

Map[ToUpperCase, Take[introList, 40]]
{INTRODUCTION,WHEN,ON,BOARD,H,M,S,BEAGLE,AS,NATURALIST,I,WAS,MUCH,STRUCK,
WITH,CERTAIN,FACTS,IN,THE,DISTRIBUTION,OF,THE,INHABITANTS,OF,SOUTH,AMERICA,
AND,IN,THE,GEOLOGICAL,RELATIONS,OF,THE,PRESENT,TO,THE,PAST,INHABITANTS,OF,
THAT}
Map[ToLowerCase, Take[introList, 40]]
{introduction,when,on,board,h,m,s,beagle,as,naturalist,i,was,much,struck,
with,certain,facts,in,the,distribution,of,the,inhabitants,of,south,america,
and,in,the,geological,relations,of,the,present,to,the,past,inhabitants,of,
that}
Map[StringLength, Take[introList, 40]]
{12,4,2,5,1,1,1,6,2,10,1,3,4,6,4,7,5,2,3,12,2,3,11,2,5,7,3,2,3,10,9,2,3,7,
2,3,4,11,2,4}

Computing word frequencies

In order to compute word frequencies, we first convert all words to lowercase, the sort them and count how often each appears using the Tally command. This gives us a list of lists, where each of the smaller lists contains a single word and its frequency.

lowerIntroList=Map[ToLowerCase,introList];
sortedIntroList=Sort[lowerIntroList];
wordFreq=Tally[sortedIntroList];
Short[wordFreq]
{{1837,1},{1844,2},<>,{years,3},{yet,2}}

Finally we can sort our tally list by the frequency of each item. This is traditionally done in descending order. In Mathematica we can change the sort order by passing the Sort command an anonymous function. (It isn’t crucial for this example to understand exactly how this works, but it is explained in the next section if you are curious. If not, just skip ahead.)

sortedFrequencyList = Sort[wordFreq, #1[[2]] > #2[[2]] &];
Short[sortedFrequencyList, 8]
{{the,100},{of,91},{to,54},{and,52},{i,44},{in,37},{that,27},{a,24},
{this,20},{it,20},{be,20},{which,18},<>,{admiration,1},{admirably,1},
{adduced,1},{adapted,1},{acquire,1},{acknowledging,1},{acknowledged,1},
{accuracy,1},{account,1},{absolutely,1},{1837,1}}

Here are the twenty most frequent words:

Take[sortedFrequencyList, 20]
{{"the", 100}, {"of", 91}, {"to", 54}, {"and", 52}, {"i", 44},
{"in", 37}, {"that", 27}, {"a", 24}, {"this", 20}, {"it", 20},
{"be", 20}, {"which", 18}, {"have", 18}, {"species", 17},
{"on", 17}, {"is", 17}, {"as", 17}, {"my", 13}, {"been", 13},
{"for", 11}}

The Cases statement pulls every item from a list that matches a pattern. Here we are looking to see how often the word “modification” appears.

Cases[wordFreq, {"modification", _}]
{{"modification", 4}}

Aside: Anonymous Functions

Most programming languages let you define new functions, and Mathematica is no exception. You can use these new functions with built-in commands like Map.

plus2[x_]:=
Return[x+2]

Map[plus2, {1, 2, 3}]
{3, 4, 5}

Being able to define functions allows you to

  • hide details: as long as you can use a function like plus2 you may not care how it works
  • reuse and share code: so you don’t have to keep reinventing the wheel.

In Mathematica, you can also create anonymous functions. One way of writing an anonymous function in Mathematica is to use a Slot in place of a variable.

# + 2 &

So we don’t have to define our function in advance, we can just write it where we need it.

Map[# + 2 &, {1, 2, 3}]
{3, 4, 5}

We can apply an anonymous function to an argument like this, and Mathematica will return the number 42.

(# + 2 &)[40]

A named function like plus2 is still sitting there when we’re done with it. An anonymous function disappears immediately after use.

N-grams

The Partition command can be used to create n-grams. This tells Mathematica to give us all of the partitions of a list that are two elements long and that are offset by one.

bigrams = Partition[lowerIntroList, 2, 1];
Short[bigrams, 8]
{{introduction,when},{when,on},{on,board},{board,h},{h,m},{m,s},{s,beagle},
{beagle,as},{as,naturalist},<>,{been,the},{the,main},{main,but},
{but,not},{not,exclusive},{exclusive,means},{means,of},{of,modification},
{modification,chapter}}

We can tally and sort bigrams, too.

sortedBigrams = Sort[Tally[bigrams], #1[[2]] > #2[[2]] &];
Short[sortedBigrams, 8]
{{{of,the},21},{{in,the},13},{{i,have},11},{{to,the},11},{{which,i},7},
{{to,me},7},{{of,species},6},{{i,shall},5},<>,{{beagle,as},1},
{{s,beagle},1},{{m,s},1},{{h,m},1},{{board,h},1},{{on,board},1},
{{when,on},1},{{introduction,when},1}}

Concordance (Keyword in Context)

A concordance shows keywords in the context of surrounding words. We can make one of these quite easily if we starting by generating n-grams. Then we use Cases to pull out all of the 5-grams in the Introduction that have “organic” as the middle word (for example), and format the output with the TableForm command.

fivegrams=Partition[lowerIntroList,5,1];
TableForm[Cases[fivegrams,{_,_,"organic",_,_}]]
affinities of    organic beings on
several distinct organic beings by
coadaptations of organic beings to
amongst all      organic beings throughout
succession of    organic beings throughout

Removing stop words

Mathematica has access to a lot of built-in, curated data. Here we grab a list of English stopwords.

stopWords = WordData[All, "Stopwords"];
Short[stopWords, 4]
{0,1,2,3,4,5,6,7,8,9,a,A,about,above,<<234>>,with,within,without,would,x,X,y,Y,yet,you,your,yours,z,Z}

The Select command allows us to use a function to pull items from a list. We want everything that is not a member of the list of stop words.

Short[lowerIntroList, 8]
{introduction,when,on,board,h,m,s,beagle,as,naturalist,i,was,much,struck,
with,certain,facts,in,the,<<1676>>,species,furthermore,i,am,convinced,
that,natural,selection,has,been,the,main,but,not,exclusive,means,of,
modification,chapter}
lowerIntroNoStopwords = 
  Select[lowerIntroList, Not[MemberQ[stopWords, #]] &];
Short[lowerIntroNoStopwords, 8]
{introduction,board,beagle,naturalist,struck,certain,facts,distribution,
inhabitants,south,america,geological,relations,present<<697>>,species,
descendants,species,furthermore,am,convinced,natural,selection,main,
exclusive,means,modification,chapter}

Bigrams containing the most frequent words

Here is a more complicated example built mostly from functions we’ve already seen. We start by finding the ten most frequently occuring words once we have gotten rid of stop words.

freqWordCounts = 
 Take[Sort[
   Tally[Take[
     lowerIntroNoStopwords, {1, -120}]], #1[[2]] > #2[[2]] &], 10]
{{shall,9},{species,9},{facts,9},{chapter,5},{variation,5},
{conditions,5},{beings,5},{organic,5},{conclusions,5},{subject,5}}

We remove a few of the words we are not interested in, then we rewrite the bigrams as a list of graph edges. This will be useful for visualizing the results as a network.

freqWords = 
  Complement[Map[First, freqWordCounts], {"shall", "subject"}];
edgeList = 
  Map[#[[1]] -> #[[2]] &, Partition[lowerIntroNoStopwords, 2, 1]];
Short[edgeList, 4]
{introduction->board,board->beagle,beagle->naturalist,<<717>>,
exclusive->means,means->modification,modification->chapter}

We grab the most frequent ones.

freqBigrams = Union[Select[edgeList, MemberQ[freqWords, #[[1]]] &],
   Select[edgeList, MemberQ[freqWords, #[[2]]] &]];
Short[freqBigrams, 4]
{abstract->variation,affinities->organic,allied->species,<<87>>,varieties->species,varying->conditions,volume->facts}

Finally we can visualize the results as a network. When you are exploring a text this way, you often want to keep tweaking your parameters and see if anything interesting comes up.

Framed[Pane[
  GraphPlot[freqBigrams, 
   Method -> {"SpringElectricalEmbedding", 
     "InferentialDistance" -> .1, "RepulsiveForcePower" -> -4}, 
   VertexLabeling -> True, DirectedEdges -> True, 
   ImageSize -> {1100, 800}], {400, 400}, Scrollbars -> True, 
  ScrollPosition -> {400, 200}]]

Document frequencies

We have been looking at the Introduction to Origin. We can also calculate word frequencies for the whole document. When we list the fifty most common words (not including stop words) we can get a better sense of what the whole book is about.

sampleList = 
  Map[ToLowerCase, StringSplit[sample, Except[WordCharacter] ..]];
docFreq = Sort[Tally[Sort[sampleList]], #1[[2]] > #2[[2]] &];
Take[Select[Take[docFreq, 200], 
  Not[MemberQ[stopWords, First[#]]] &], 50]
{{species,1489},{forms,397},{varieties,396},{selection,383},
{natural,361},{life,298},{plants,297},{different,282},{case,281},
{animals,280},{great,260},{distinct,255},{nature,253},{having,252},
{new,244},{long,243},{period,238},{cases,224},{believe,216},
{structure,214},{conditions,211},{genera,210},{generally,199},
{number,198},{common,194},{far,193},{time,191},{degree,190},
{groups,173},{characters,170},{certain,169},{view,168},{large,168},
{instance,165},{modification,161},{facts,157},{closely,155},
{parts,154},{intermediate,154},{modified,153},{genus,147},
{present,143},{birds,143},{produced,141},{individuals,140},
{inhabitants,139},{parent,138},{world,136},{character,136},
{organic,135}}

TF-IDF: Term frequency-Inverse document frequency

The basic intuition behind tf-idf is as follows…

  • A word that occurs frequently on every page doesn’t tell you anything special about that page. It is a stop word.
  • A word that occurs only a few times in the whole document or corpus can be ignored.
  • A word that occurs a number of times on one page but is relatively rare in the document or corpus overall can give you some idea what the page is about.

Here is one way to calculate tf-idf (there are lots of different versions)

tfidf[termfreq_,docfreq_,numdocs_]:=
     Log[termfreq+1.0] Log[numdocs/docfreq]

Using document frequencies and TF-IDF we can get a sense of what different parts of a text are about. Here is how we would analyze chapter 9 (there are 15 chapters in all).

ch9 = StringCases[sample, Shortest["CHAPTER 9" ~~ __ ~~ "CHAPTER"]][[
   1]];
ch9List = Map[ToLowerCase, StringSplit[ch9, Except[WordCharacter] ..]];
ch9Terms = Union[ch9List];
ch9TermFreq = Sort[Tally[ch9List], #1[[2]] > #2[[2]] &];
ch9DocFreq = Select[docFreq, MemberQ[ch9Terms, #[[1]]] &];
computeTFIDF[termlist_, tflist_, dflist_] :=
 Module[{outlist, tf, df},
  outlist = {};
  Do[
   tf = Cases[tflist, {t, x_} -> x][[1]];
   df = Cases[dflist, {t, x_} -> x][[1]];
   outlist = Append[outlist, {t, tf, df, tfidf[tf, df, 15.0]}],
   {t, termlist}];
  Return[outlist]]
ch9TFIDF = 
  Sort[computeTFIDF[ch9Terms, ch9TermFreq, 
    ch9DocFreq], #1[[4]] > #2[[4]] &];
Take[ch9TFIDF, 50][[All, 1]]
{teleostean,tapir,richest,pebbles,mississippi,downs,decay,conchologists,
wear,thinner,tear,supplement,superimposed,sedgwick,rolled,poorness,nodules,
mineralogical,levels,inadequate,grinding,gravel,downward,denuded,comprehend,
chthamalus,atom,accumulations,sand,ramsay,littoral,sedimentary,wears,wearing,
wealden,watermark,watch,vehemently,valve,upright,unimproved,unfathomable,
undermined,underlies,unanimously,ubiquitous,transmutation,tides,tidal,swarmed}

Whether or not you are familiar with nineteenth-century science, it should be clear that the chapter has something to do with geology. Darwin also provided chapter summaries of his own:

StringTake[ch9, 548]
CHAPTER 9. ON THE IMPERFECTION OF THE GEOLOGICAL RECORD. On the absence 
of intermediate varieties at the present day. On the nature of extinct 
intermediate varieties; on their number. On the vast lapse of time, as 
inferred from the rate of deposition and of denudation. On the poorness 
of our palaeontological collections. On the intermittence of geological 
formations. On the absence of intermediate varieties in any one formation. 
On the sudden appearance of groups of species. On their sudden appearance 
in the lowest known fossiliferous strata.

In September, Tim Hitchcock and I had a chance to meet with Adam Farquhar at the British Library to talk about potential collaborative research projects. Adam suggested that we might do something with a collection of about 25,000 E-books. Although I haven’t had much time yet to work with the sources, one of the things that I am interested in is using techniques from image processing and computer vision to supplement text mining. As an initial project, I decided to see if I could find a way to automatically extract images from the collection.

My first thought was that I might be able to identify text based on its horizontal and vertical correlation. Parts of the image that were not text would then be whitespace, illustration or marginalia. (One way to do this in Mathematica is to use the ImageCooccurence function). As I was trying to figure out the details, however, I realized that a much simpler approach might work. Since the method seems promising I decided to share it so that other people might get some use out of it (or suggest improvements).

In a British Library E-book, each page has a JPEG page image and an associated ALTO (XML) file which contains the OCRed text. The basic idea is to compare the JPEG image file size with the ALTO file size for the same page. Pages that have a lot of text (and no images) should have large ALTO files relative to the size of the JPEG. Pages with an image but little or no text should have a large JPEG relative to the size of the ALTO file. Blank pages should have relatively small JPEG and ALTO files.

The graph below shows the results for an E-book chosen at random from the sample. Textual pages cluster, pages with images tend to cluster, and blank pages (and covers) fall out along one axis because they have no text at all. We can use more sophisticated image processing and machine learning to further subdivide images and extract them once they are located, but this seems pretty good for a first pass.

In my previous post, I showed how to connect an Arduino microcontroller to Mathematica on Mac OS X using the SerialIO package.  It is also quite straightforward to interact with Phidgets.  In this case we can take advantage of Mathematica’s J/Link Java interface to call the Phidgets API.  This is basically a ‘hello world’ demonstration.  For a real application you would include error handling, event driven routines, and so on.  For more details, read the Java getting started tutorial and Phidgets programming manual, then look at the sample code and javadocs on this page.

Start by installing the Mac OS X Phidgets driver on your system. Once you have run Phidgets.mpkg you can open System Preferences and there will be a pane for Phidgets.  For my test, I used a PhidgetInterfaceKit 8/8/8 with an LED on Output 2 and a 60mm slider (potentiometer) attached to Sensor 0. Once you have the hardware configuration you like, plug the InterfaceKit into the USB. It should show up in the General tab of system preferences. If you double click on the entry, it will start a demonstration program that allows you to make sure you can toggle the LED and get values back from the slider. When everything is working correctly, you can close the program and open Mathematica.

In a Mathematica notebook, you are going to load the J/Link package, install Java, and put the phidget21.jar file on your class path by editing the AddToClassPath[] command in the snippet below.

Needs["JLink`"]
InstallJava[]
AddToClassPath["/path/to/phidget21"]
phidgetsClass = LoadJavaClass["com.phidgets.InterfaceKitPhidget"]

Next, create a new instance of the InterfaceKit object, open it and wait for attachment. You can include a timeout value if you’d like. Once the InterfaceKit is attached, you can query it for basic information like device name, serial number and sensor and IO options.

ik = JavaNew[phidgetsClass]
ik@openAny[]
ik@waitForAttachment[]
ik@getSerialNumber[]
ik@getDeviceName[]
{ik@getOutputCount[], ik@getInputCount[], ik@getSensorCount[]}

Finally you can use Mathematica‘s Dynamic[] functionality to create a virtual slider in the notebook that will waggle back and forth as you move the physical slider attached to the InterfaceKit. You can also turn the LED on and off by clicking a dynamic checkbox in the notebook.

Slider[
 Dynamic[
  Refresh[ik@getSensorValue[0], UpdateInterval -> 0.1]], {0, 1000}]

bool=false;
Dynamic[ik@setOutputState[2, bool]]
Checkbox[Dynamic[bool]]

When you are finished experimenting, close the InterfaceKit object.

ik@close[]

I’ve been programming regularly in Mathematica for more than a year, using the language mostly for spidering, text mining and machine learning applications. But now that I am teaching my interactive exhibit design course again, I’ve started thinking about using Mathematica for physical computing and desktop fabrication tasks. First on my to do list was to find a way to send and receive data from the Arduino. A quick web search turned up the work of Keshav Saharia, who is close to releasing a package called ArduinoLink that will make this easy. In the meantime, Keshav helped me to debug a simple demonstration that uses the SerialIO package created by Rob Raguet-Schofield. There were a few hidden gotchas involved in getting this working on Mac OS X, so I thought I would share the process with others who may be interested in doing something similar.

On the Arduino side, I attached a potentiometer to Analog 1, and then wrote a simple program that waits for a signal from the computer, reads the sensor and then sends the value back on the serial port.  It is based on the Serial Call and Response tutorial on the Arduino website.

/*
 arduino_mathematica_example

 This code is adapted from

http://arduino.cc/en/Tutorial/SerialCallResponse

 When started, the Arduino sends an ASCII A on the serial port until
 it receives a signal from the computer. It then reads Analog 1,
 sends a single byte on the serial port and waits for another signal
 from the computer.

 Test it with a potentiometer on A1.
 */

int sensor = 0;
int inByte = 0;

void setup() {
  Serial.begin(9600);
  establishContact();
}

void loop() {
  if (Serial.available() > 0) {
    inByte = Serial.read();
    // divide sensor value by 4 to return a single byte 0-255
    sensor = analogRead(A1)/4;
    delay(15);
    Serial.write(sensor);
  }
}

void establishContact() {
  while (Serial.available() <= 0) {
    Serial.print('A');
    delay(100);
  }
}

Once the sketch is installed on the Arduino, close the Arduino IDE (otherwise the device will look busy when you try to interact with it from Mathematica).  On the computer side, you have to install the SerialIO package in

/Users/username/Library/Mathematica/Applications

and make sure that it is in your path.  If the following command does not evaluate to True

MemberQ[$Path, "/Users/username/Library/Mathematica/Applications"]

then you need to run this command

AppendTo[$Path, "/Users/username/Library/Mathematica/Applications"]

Next, edit the file

/Users/username/Library/Mathematica/Applications/SerialIO/Kernal/init.m

so the line

$Link = Install["SerialIO"]

reads

$Link =
Install["/Users/username/Library/Mathematica/Applications/SerialIO/MacOSX/SerialIO",
LinkProtocol -> "Pipes"]

If you need to find the port name for your Arduino, you can open a terminal and type

ls /dev/tty.*

The demonstration program is shown below.  You can download both the Arduino / Wiring sketch and the Mathematica notebook from my GitHub repository.  You need to change the name of the serial device to whatever it is on your own machine.

<<SerialIO`

myArduino = SerialOpen["/dev/tty.usbmodem3a21"]

SerialSetOptions[myArduino, "BaudRate" -> 9600]

SerialReadyQ[myArduino]

Slider[
 Dynamic[Refresh[SerialWrite[myArduino, "B"];
  First[SerialRead[myArduino] // ToCharacterCode],
  UpdateInterval -> 0.1]], {0, 255}]

The Mathematica code loads the SerialIO package, sets the rate of the serial connection to 9600 baud to match the Arduino, and then polls the Arduino ten times per second to get the state of the potentiometer.  It doesn’t matter what character we send the Arduino (here we use an ASCII B).  We need to use ToCharacterCode[] to convert the response to an integer between 0 and 255.  If everything worked correctly, you should see the slider wiggle back and forth in Mathematica as you turn the potentiometer.  When you are finished experimenting, you need to close the serial link to the Arduino with

SerialClose[myArduino]

For a number of years I’ve taught a studio course in our public history graduate program on designing interactive exhibits. Most academic historians present their work in monographs and journal articles unless they are way out there on the fringe, in which case they may be experimenting with trade publications, documentary film, graphic novels, photography, websites, blogs, games or even more outré genres. Typically the emphasis remains on creating representations that are intended to be read in some sense, ideally very carefully. Public historians, however, need to be able to communicate to larger and more disparate audiences, in a wider variety of venues, and in settings where they may not have all, or even much, of the attention of their publics. Exhibits that are designed merely to be read closely are liable to be mostly ignored. When that happens, of course, it doesn’t matter how interesting your interpretation is.

Students in the course learn how to embed their interpretations in interactive, ambient and tangible forms that can be recreated in many different settings. To give some idea of the potential, consider the difference between writing with a word processor and stepping on the brake of a moving car. While using a word processor you are focused on the task and aware that you are interacting with a computer. The interface is intricate, sensorimotor involvement is mostly limited to looking and typing, and your surrounding environment recedes into the background of awareness. On the other hand, when braking you are focused on your involvement with the environment. Sensorimotor experiences are immersive, the interface to the car is as simple as possible, and you are not aware that you are interacting with computers (although recent-model cars in fact have dozens of continuously operating and networked microcontrollers).

Academic historians have tended to emphasize opportunities for knowledge dissemination that require our audience to be passive, focused and isolated from one another and from their surroundings. When we engage with a broader public, we need to supplement that model by building some of our research findings into communicative devices that are transparently easy to use, provide ambient feedback, and are closely coupled with the surrounding environment. The skills required to do this come from a number of research fields that ultimately depend on electronics and computers. Thanks to the efforts of community-minded makers, hackers, and researchers, these techniques are relatively easy to learn and apply.

Physical computing. In order to make objects or environments aware of people, to make them responsive and interactive, we need to give them a better sense of what human beings are like and what they’re capable of (Igoe & O’Sullivan 2004; Igoe 2011). Suppose your desktop computer had to guess what you look like based on your use of a word processer.  It could assume that you have an eye and an ear–because you respond to things presented on the screen and to beeps–and it could assume you have a finger–because you push keys on the keyboard. To dramatize this, I usually use the image above, which is based on a drawing in Igoe and O’Sullivan (2004). It looks horrible: people are nothing like that. By giving our devices a better sense of what we’re actually like, we make it possible for them to better fit into our ongoing lifeworlds.

Pervasive computing. We are at the point where computational devices are becoming ubiquitous, invisible, part of the surroundings (McCullough 2004). The design theorist Adam Greenfield refers to this condition as “everyware” (2006). A number of technologies work together to make this possible. Embedded microprocessors put the power of full computers into tiny packages. Micro-electro-mechanical systems (MEMS) include sensors and actuators to sense and control the environment. Radio transceivers allow these miniature devices to communicate with one another and get online. Passive radio frequency ID circuits (RFIDs) are powered by radio waves to transmit identifying information. All of these systems are mass-produced so that unit costs are very low, and it becomes possible to imagine practically everything being manufactured with its own unique identifier and web address. This scenario is sometimes called the “internet of things.” Someday instead of searching for your keys you may be able to Google for them instead. As Bruce Sterling notes, practically everything in the world could become the “protagonist of a documented process” (2005). Provenance has typically had to be reconstructed painstakingly for a tiny handful of objects. Most historians are not ready to conduct research in a world where every object can tell us about its own history of manufacture, ownership, use, repair, and so on. Dealing with pervasive computation will require the ability to quickly focus on essential information, to relegate non-essential information to peripheral awareness, and to access information in the places and settings where it can make a difference.

Interaction Design. The insinuation of computation and interactivity into every conceivable setting has forced designers to abandon the traditional idea of “human-computer interaction,” and to take a much more expansive perspective instead (Moggridge 2006; Saffer 2006). Not only is everything becoming a potential interface, but many smart devices are better conceptualized as mediating between people, rather than between person and machine. Services like ordering a cup of coffee at Starbucks are now designed using the same techniques as those used to create interactive software (e.g., Google calendar) and hardware (e.g., the iPod). In order to benefit from the lessons of interaction design, historians will have to take into account the wide range of new settings where we can design experiences and shape historical consciousness. The technology of tangible computing provides a link between pervasive devices, social interaction, and the material environment (Dourish 2004).

Desktop Fabrication. Most radical of all, everything that is in digital form can be materialized, via machines that add or subtract matter. The former include a range of 3D printing technologies that deposit tiny amounts of glue, plastic or other materials, or that use lasers to selectively fuse small particles of metal, ceramic or plastic. The latter include computer-controlled milling machines, lathes, drills, grinders, laser cutters and other tools. The cost of these devices has been dropping rapidly, while their ease-of-use increases. The physicist Neil Gershenfeld has assembled a number of “fab labs”—universal fabrication laboratories—from collections of these devices. At present, a complete fab lab costs around $30-$40,000 and a few key machines are considerably cheaper (Gershenfeld 2000, 2007). Enthusiasts talk about the possibility of downloading open source plans and “printing out” a bicycle, an electric guitar, anything really. An open source hardware community is blossoming, aided in part by O’Reilly Media’s popular MAKE magazine and by websites like Instructables and Thingiverse. Desktop fabrication makes it possible to build and share custom interactive devices that communicate our knowledge in novel, material forms.

References

  • Dourish, Paul. Where the Action Is: The Foundations of Embodied Interaction. Cambridge, MA: MIT, 2004.
  • Gershenfeld, Neil. When Things Start to Think. New York: Holt, 2000.
  • Gershenfeld, Neil. Fab: The Coming Revolution on Your Desktop—From Personal Computers to Personal Fabrication. New York: Basic, 2007.
  • Greenfield, Adam. Everyware: The Dawning Age of Ubiquitous Computing. Berkeley, CA: New Riders, 2006.
  • Igoe, Tom. Making Things Talk, 2nd ed. Sebastopol, CA: O’Reilly, 2011.
  • Igoe, Tom and Dan O’Sullivan. Physical Computing: Sensing and Controlling the Physical World with Computers. Thomson Course Technology, 2004.
  • McCullough, Malcolm. Digital Ground: Architecture, Pervasive Computing and Environmental Knowing. Cambridge, MA: MIT, 2004.
  • Moggridge, Bill. Designing Interactions. Cambridge, MA: MIT, 2006.
  • Norretranders, Tor. The User Illusion: Cutting Consciousness Down to Size. New York: Penguin, 1999.
  • Saffer, Dan. Designing for Interaction: Creating Smart Applications and Clever Devices. Berkeley, CA: New Riders, 2006.
  • Sterling, Bruce. Shaping Things. Cambridge, MA: MIT, 2005.
  • Torrone, Phillip. “Open Source Hardware, What Is It? Here’s a Start…” MAKE: Blog (23 Apr 2007).

When Alan MacEachern and I started working on the first edition of The Programming Historian in late 2007, our goal was to create an online resource that could be used by historians and other humanists to teach themselves a little bit of programming. Many introductory texts and websites approach programming languages in a systematic (if dull) way, starting with basics such as data types and gradually introducing various language constructs. This is fine if you already know how to program. Most beginners, however, are more concerned with addressing a practical need than they are with learning technical details that don’t seem to be immediately relevant. We wanted to approach programming as a means of expression. Plenty of time to begin learning grammar after you’ve had a few conversations, as it were.

Neither of us expected PH to become nearly as popular as it did. We’re still young (OK, technically we’re middle aged) but we’ll have to work pretty hard to ever gain as many readers for anything else we write. While gratifying, the success of the first edition raised new problems. Some people wanted to pitch in. Some wanted help with particular problems. Some wanted to translate the material into other natural languages or other programming languages. In the meantime, websites changed, operating systems changed, software libraries changed, programming languages changed. Change is good! But dealing with change is difficult if one or two people try to do everything by themselves. Fortunately, there is a better way.

For a couple of years now, Adam Crymble has been working with us on creating a new edition of The Programming Historian that will be open to user contributions. There will be a number of ways for people to get involved: as writers, programmers, editors, technical reviewers, testers, website hackers, graphic designers, discussants, translators, and so on.  All contributions will be peer-reviewed, and everyone who participates will get credit for his or her work.  One of our aims with the first edition was to maintain a narrative thread that led the user through a series of useful projects. The new edition is organized in terms of short lessons that build on knowledge acquired in previous ones. Informally, you might think of this as “choose your own adventure.” Technically the new site will be structured like a directed acyclic graph, with tools that make it easy to keep track of what you’ve learned so far and provide you with a number of choices going forward. All source code will be under version control, making it easy to maintain and fork.

Over the next few months we will be inviting beta contributors to help us design and develop the website, write and program new lessons, do editing and peer reviewing, and generally turn the goodness up to 11.  There will be new lessons that lead into subjects like visualization, geospatial data, image search, integration with external tools, and the use of APIs. We will also be working with new institutional partners and exploring the connections to be made with other, similar projects. If you would like to get involved, please don’t hesitate to e-mail me and let me know. We will do a public launch when everything is working smoothly and we are ready to accept general contributions, hopefully sometime in 2012.

Follow

Get every new post delivered to your Inbox.

Join 146 other followers