Basic Text Analysis in Mathematica

2012/11/19 //

(If you have Mathematica you can download this as a notebook from my GitHub account. It is also available as a CDF document which can be read with Wolfram’s free CDF Player.)

Introduction

For a couple of years now I have been using Mathematica as my programming language of choice for my digital history work. For one thing, I love working with notebooks, which allow me to mix prose, citations, live data, executable code, manipulable simulations and other elements in a single document. I also love the generality of Mathematica. For any kind of technical work, there is usually a well-developed body of theory that is expressed in objects drawn from some branch of mathematics. Chances are, Mathematica already has a large number of high-level functions for working with those mathematical objects. The Mathematica documentation is excellent, if necessarily sprawling, since there are literally thousands of commands. The challenge is usually to find the commands that you need to solve a given problem. Since few Mathematica programmers seem to be working historians or humanists dealing with textual sources, it can be difficult to figure out where to begin.

Using a built-in text

As a sample text, we will use the Darwin’s Origin of Species from Mathematica‘s built-in example database. The Short command shows a small piece of something large. Here we’re asking to see the two lines at the beginning and end of this text.

sample = ExampleData[{"Text", "OriginOfSpecies"}];
Short[sample, 2]

Mathematica responds with

INTRODUCTION. When on board H.M.S. ... have been, and are being, evolved.

The Head command tells us what something is. Our text is currently a string, an ordered sequence of characters.

Head[sample]

String

Extracting part of a string

Suppose we want to work with part of the text. We can extract the Introduction of Origin by pulling out everything between “INTRODUCTION” and “CHAPTER 1”. The command that we use to extract part of a string is called StringCases. Once we have extracted the Introduction, we want to check to make sure that the command worked the way we expected. Rather than look at the whole text right now, we can use the Short command to show us about five line of the text. It returns a couple of phrases at the beginning and end, using ellipses to indicate the much larger portion which we are not seeing.

intro=StringCases[sample,Shortest["INTRODUCTION"~~__~~"CHAPTER"]][[1]];
Short[intro,5]

INTRODUCTION. When on board H.M.S. 'Beagle,' as naturalist, I was much struck with certain fac... nced that Natural Selection has been the main but not exclusive means of modification. CHAPTER

Note the use of the Shortest command in the string matching expression above. Since there are probably multiple copies of the word “CHAPTER” in the text, we have to tell Mathematica how much of the text we want to match… do we want the portion between “INTRODUCTION” and the first instance of the word, the second, the last? Here are two examples to consider:

StringCases["bananarama","b"~~__~~"a"]

{bananarama}

StringCases["bananarama",Shortest["b"~~__~~"a"]]

{bana}

From a string to a list of words

It will be easier for us to analyze the text if we turn it into a list of words. In order to eliminate punctuation, I am going to get rid of everything that is not a word character. Note that doing things this way turns the abbreviation “H.M.S.” into three separate words.

introList=StringSplit[intro,Except[WordCharacter]..];
Short[introList,4]

{INTRODUCTION,When,on,board,H,M,S,Beagle,as,<>,the,main,but,not,exclusive,means,of,modification,CHAPTER}

Mathematica has a number of commands for selecting elements from lists. The Take command allows us to extract a given number of items from the beginning of a list.

Take[introList,40]

{INTRODUCTION,When,on,board,H,M,S,Beagle,as,naturalist,I,was,much,struck,
with,certain,facts,in,the,distribution,of,the,inhabitants,of,South,America,
and,in,the,geological,relations,of,the,present,to,the,past,inhabitants,
of,that}

The First command returns the first item in a list, and the Rest command returns everything but the first element. The Last command returns the last item.

First[introList]

INTRODUCTION

Short[Rest[introList]]

{When,on,<>,modification,CHAPTER}

Last[introList]

CHAPTER

We can also use an index to pull out list elements.

introList[[8]]

Beagle

We can test whether or not a given item is a member of a list with the MemberQ command.

MemberQ[introList, "naturalist"]

True

MemberQ[introList, "naturist"]

False

Processing each element in a list

If we want to apply some kind of function to every element in a list, the most natural way to accomplish this in Mathematica is with the Map command. Here we show three examples using the first 40 words of the Introduction. Note that Map returns a new list rather than altering the original one.

Map[ToUpperCase, Take[introList, 40]]

{INTRODUCTION,WHEN,ON,BOARD,H,M,S,BEAGLE,AS,NATURALIST,I,WAS,MUCH,STRUCK,
WITH,CERTAIN,FACTS,IN,THE,DISTRIBUTION,OF,THE,INHABITANTS,OF,SOUTH,AMERICA,
AND,IN,THE,GEOLOGICAL,RELATIONS,OF,THE,PRESENT,TO,THE,PAST,INHABITANTS,OF,
THAT}

Map[ToLowerCase, Take[introList, 40]]

{introduction,when,on,board,h,m,s,beagle,as,naturalist,i,was,much,struck,
with,certain,facts,in,the,distribution,of,the,inhabitants,of,south,america,
and,in,the,geological,relations,of,the,present,to,the,past,inhabitants,of,
that}

Map[StringLength, Take[introList, 40]]

{12,4,2,5,1,1,1,6,2,10,1,3,4,6,4,7,5,2,3,12,2,3,11,2,5,7,3,2,3,10,9,2,3,7,
2,3,4,11,2,4}

Computing word frequencies

In order to compute word frequencies, we first convert all words to lowercase, the sort them and count how often each appears using the Tally command. This gives us a list of lists, where each of the smaller lists contains a single word and its frequency.

lowerIntroList=Map[ToLowerCase,introList];
sortedIntroList=Sort[lowerIntroList];
wordFreq=Tally[sortedIntroList];
Short[wordFreq]

{{1837,1},{1844,2},<>,{years,3},{yet,2}}

Finally we can sort our tally list by the frequency of each item. This is traditionally done in descending order. In Mathematica we can change the sort order by passing the Sort command an anonymous function. (It isn’t crucial for this example to understand exactly how this works, but it is explained in the next section if you are curious. If not, just skip ahead.)

sortedFrequencyList = Sort[wordFreq, #1[[2]] > #2[[2]] &];
Short[sortedFrequencyList, 8]

{{the,100},{of,91},{to,54},{and,52},{i,44},{in,37},{that,27},{a,24},
{this,20},{it,20},{be,20},{which,18},<>,{admiration,1},{admirably,1},
{adduced,1},{adapted,1},{acquire,1},{acknowledging,1},{acknowledged,1},
{accuracy,1},{account,1},{absolutely,1},{1837,1}}

Here are the twenty most frequent words:

Take[sortedFrequencyList, 20]

{{"the", 100}, {"of", 91}, {"to", 54}, {"and", 52}, {"i", 44},
{"in", 37}, {"that", 27}, {"a", 24}, {"this", 20}, {"it", 20},
{"be", 20}, {"which", 18}, {"have", 18}, {"species", 17},
{"on", 17}, {"is", 17}, {"as", 17}, {"my", 13}, {"been", 13},
{"for", 11}}

The Cases statement pulls every item from a list that matches a pattern. Here we are looking to see how often the word “modification” appears.

Cases[wordFreq, {"modification", _}]

{{"modification", 4}}

Aside: Anonymous Functions

Most programming languages let you define new functions, and Mathematica is no exception. You can use these new functions with built-in commands like Map.

plus2[x_]:=
Return[x+2]

Map[plus2, {1, 2, 3}]

{3, 4, 5}

Being able to define functions allows you to

hide details: as long as you can use a function like plus2 you may not care how it works
reuse and share code: so you don’t have to keep reinventing the wheel.

In Mathematica, you can also create anonymous functions. One way of writing an anonymous function in Mathematica is to use a Slot in place of a variable.

# + 2 &

So we don’t have to define our function in advance, we can just write it where we need it.

Map[# + 2 &, {1, 2, 3}]

{3, 4, 5}

We can apply an anonymous function to an argument like this, and Mathematica will return the number 42.

(# + 2 &)[40]

A named function like plus2 is still sitting there when we’re done with it. An anonymous function disappears immediately after use.

N-grams

The Partition command can be used to create n-grams. This tells Mathematica to give us all of the partitions of a list that are two elements long and that are offset by one.

bigrams = Partition[lowerIntroList, 2, 1];
Short[bigrams, 8]

{{introduction,when},{when,on},{on,board},{board,h},{h,m},{m,s},{s,beagle},
{beagle,as},{as,naturalist},<>,{been,the},{the,main},{main,but},
{but,not},{not,exclusive},{exclusive,means},{means,of},{of,modification},
{modification,chapter}}

We can tally and sort bigrams, too.

sortedBigrams = Sort[Tally[bigrams], #1[[2]] > #2[[2]] &];
Short[sortedBigrams, 8]

{{{of,the},21},{{in,the},13},{{i,have},11},{{to,the},11},{{which,i},7},
{{to,me},7},{{of,species},6},{{i,shall},5},<>,{{beagle,as},1},
{{s,beagle},1},{{m,s},1},{{h,m},1},{{board,h},1},{{on,board},1},
{{when,on},1},{{introduction,when},1}}

Concordance (Keyword in Context)

A concordance shows keywords in the context of surrounding words. We can make one of these quite easily if we starting by generating n-grams. Then we use Cases to pull out all of the 5-grams in the Introduction that have “organic” as the middle word (for example), and format the output with the TableForm command.

fivegrams=Partition[lowerIntroList,5,1];
TableForm[Cases[fivegrams,{_,_,"organic",_,_}]]

affinities of    organic beings on
several distinct organic beings by
coadaptations of organic beings to
amongst all      organic beings throughout
succession of    organic beings throughout

Removing stop words

Mathematica has access to a lot of built-in, curated data. Here we grab a list of English stopwords.

stopWords = WordData[All, "Stopwords"];
Short[stopWords, 4]

{0,1,2,3,4,5,6,7,8,9,a,A,about,above,<<234>>,with,within,without,would,x,X,y,Y,yet,you,your,yours,z,Z}

The Select command allows us to use a function to pull items from a list. We want everything that is not a member of the list of stop words.

Short[lowerIntroList, 8]

{introduction,when,on,board,h,m,s,beagle,as,naturalist,i,was,much,struck,
with,certain,facts,in,the,<<1676>>,species,furthermore,i,am,convinced,
that,natural,selection,has,been,the,main,but,not,exclusive,means,of,
modification,chapter}

lowerIntroNoStopwords = 
  Select[lowerIntroList, Not[MemberQ[stopWords, #]] &];
Short[lowerIntroNoStopwords, 8]

{introduction,board,beagle,naturalist,struck,certain,facts,distribution,
inhabitants,south,america,geological,relations,present<<697>>,species,
descendants,species,furthermore,am,convinced,natural,selection,main,
exclusive,means,modification,chapter}

Bigrams containing the most frequent words

Here is a more complicated example built mostly from functions we’ve already seen. We start by finding the ten most frequently occuring words once we have gotten rid of stop words.

freqWordCounts = 
 Take[Sort[
   Tally[Take[
     lowerIntroNoStopwords, {1, -120}]], #1[[2]] > #2[[2]] &], 10]

{{shall,9},{species,9},{facts,9},{chapter,5},{variation,5},
{conditions,5},{beings,5},{organic,5},{conclusions,5},{subject,5}}

We remove a few of the words we are not interested in, then we rewrite the bigrams as a list of graph edges. This will be useful for visualizing the results as a network.

freqWords = 
  Complement[Map[First, freqWordCounts], {"shall", "subject"}];
edgeList = 
  Map[#[[1]] -> #[[2]] &, Partition[lowerIntroNoStopwords, 2, 1]];
Short[edgeList, 4]

{introduction->board,board->beagle,beagle->naturalist,<<717>>,
exclusive->means,means->modification,modification->chapter}

We grab the most frequent ones.

freqBigrams = Union[Select[edgeList, MemberQ[freqWords, #[[1]]] &],
   Select[edgeList, MemberQ[freqWords, #[[2]]] &]];
Short[freqBigrams, 4]

{abstract->variation,affinities->organic,allied->species,<<87>>,varieties->species,varying->conditions,volume->facts}

Finally we can visualize the results as a network. When you are exploring a text this way, you often want to keep tweaking your parameters and see if anything interesting comes up.

Framed[Pane[
  GraphPlot[freqBigrams, 
   Method -> {"SpringElectricalEmbedding", 
     "InferentialDistance" -> .1, "RepulsiveForcePower" -> -4}, 
   VertexLabeling -> True, DirectedEdges -> True, 
   ImageSize -> {1100, 800}], {400, 400}, Scrollbars -> True, 
  ScrollPosition -> {400, 200}]]

Document frequencies

We have been looking at the Introduction to Origin. We can also calculate word frequencies for the whole document. When we list the fifty most common words (not including stop words) we can get a better sense of what the whole book is about.

sampleList = 
  Map[ToLowerCase, StringSplit[sample, Except[WordCharacter] ..]];
docFreq = Sort[Tally[Sort[sampleList]], #1[[2]] > #2[[2]] &];
Take[Select[Take[docFreq, 200], 
  Not[MemberQ[stopWords, First[#]]] &], 50]

{{species,1489},{forms,397},{varieties,396},{selection,383},
{natural,361},{life,298},{plants,297},{different,282},{case,281},
{animals,280},{great,260},{distinct,255},{nature,253},{having,252},
{new,244},{long,243},{period,238},{cases,224},{believe,216},
{structure,214},{conditions,211},{genera,210},{generally,199},
{number,198},{common,194},{far,193},{time,191},{degree,190},
{groups,173},{characters,170},{certain,169},{view,168},{large,168},
{instance,165},{modification,161},{facts,157},{closely,155},
{parts,154},{intermediate,154},{modified,153},{genus,147},
{present,143},{birds,143},{produced,141},{individuals,140},
{inhabitants,139},{parent,138},{world,136},{character,136},
{organic,135}}

TF-IDF: Term frequency-Inverse document frequency

The basic intuition behind tf-idf is as follows…

A word that occurs frequently on every page doesn’t tell you anything special about that page. It is a stop word.
A word that occurs only a few times in the whole document or corpus can be ignored.
A word that occurs a number of times on one page but is relatively rare in the document or corpus overall can give you some idea what the page is about.

Here is one way to calculate tf-idf (there are lots of different versions)

tfidf[termfreq_,docfreq_,numdocs_]:=
     Log[termfreq+1.0] Log[numdocs/docfreq]

Using document frequencies and TF-IDF we can get a sense of what different parts of a text are about. Here is how we would analyze chapter 9 (there are 15 chapters in all).

ch9 = StringCases[sample, Shortest["CHAPTER 9" ~~ __ ~~ "CHAPTER"]][[
   1]];
ch9List = Map[ToLowerCase, StringSplit[ch9, Except[WordCharacter] ..]];
ch9Terms = Union[ch9List];
ch9TermFreq = Sort[Tally[ch9List], #1[[2]] > #2[[2]] &];
ch9DocFreq = Select[docFreq, MemberQ[ch9Terms, #[[1]]] &];

computeTFIDF[termlist_, tflist_, dflist_] :=
 Module[{outlist, tf, df},
  outlist = {};
  Do[
   tf = Cases[tflist, {t, x_} -> x][[1]];
   df = Cases[dflist, {t, x_} -> x][[1]];
   outlist = Append[outlist, {t, tf, df, tfidf[tf, df, 15.0]}],
   {t, termlist}];
  Return[outlist]]

ch9TFIDF = 
  Sort[computeTFIDF[ch9Terms, ch9TermFreq, 
    ch9DocFreq], #1[[4]] > #2[[4]] &];
Take[ch9TFIDF, 50][[All, 1]]

{teleostean,tapir,richest,pebbles,mississippi,downs,decay,conchologists,
wear,thinner,tear,supplement,superimposed,sedgwick,rolled,poorness,nodules,
mineralogical,levels,inadequate,grinding,gravel,downward,denuded,comprehend,
chthamalus,atom,accumulations,sand,ramsay,littoral,sedimentary,wears,wearing,
wealden,watermark,watch,vehemently,valve,upright,unimproved,unfathomable,
undermined,underlies,unanimously,ubiquitous,transmutation,tides,tidal,swarmed}

Whether or not you are familiar with nineteenth-century science, it should be clear that the chapter has something to do with geology. Darwin also provided chapter summaries of his own:

StringTake[ch9, 548]

CHAPTER 9. ON THE IMPERFECTION OF THE GEOLOGICAL RECORD. On the absence 
of intermediate varieties at the present day. On the nature of extinct 
intermediate varieties; on their number. On the vast lapse of time, as 
inferred from the rate of deposition and of denudation. On the poorness 
of our palaeontological collections. On the intermittence of geological 
formations. On the absence of intermediate varieties in any one formation. 
On the sudden appearance of groups of species. On their sudden appearance 
in the lowest known fossiliferous strata.

Categories Mathematica

A Simple Algorithm for Finding Images in E-books

2012/11/09 //

In September, Tim Hitchcock and I had a chance to meet with Adam Farquhar at the British Library to talk about potential collaborative research projects. Adam suggested that we might do something with a collection of about 25,000 E-books. Although I haven’t had much time yet to work with the sources, one of the things that I am interested in is using techniques from image processing and computer vision to supplement text mining. As an initial project, I decided to see if I could find a way to automatically extract images from the collection.

My first thought was that I might be able to identify text based on its horizontal and vertical correlation. Parts of the image that were not text would then be whitespace, illustration or marginalia. (One way to do this in Mathematica is to use the ImageCooccurence function). As I was trying to figure out the details, however, I realized that a much simpler approach might work. Since the method seems promising I decided to share it so that other people might get some use out of it (or suggest improvements).

In a British Library E-book, each page has a JPEG page image and an associated ALTO (XML) file which contains the OCRed text. The basic idea is to compare the JPEG image file size with the ALTO file size for the same page. Pages that have a lot of text (and no images) should have large ALTO files relative to the size of the JPEG. Pages with an image but little or no text should have a large JPEG relative to the size of the ALTO file. Blank pages should have relatively small JPEG and ALTO files.

The graph below shows the results for an E-book chosen at random from the sample. Textual pages cluster, pages with images tend to cluster, and blank pages (and covers) fall out along one axis because they have no text at all. We can use more sophisticated image processing and machine learning to further subdivide images and extract them once they are located, but this seems pretty good for a first pass.

Categories Mathematica, Method

Connecting Phidgets to Mathematica on Mac OS X with J/Link

2011/12/28 //

In my previous post, I showed how to connect an Arduino microcontroller to Mathematica on Mac OS X using the SerialIO package. It is also quite straightforward to interact with Phidgets. In this case we can take advantage of Mathematica’s J/Link Java interface to call the Phidgets API. This is basically a ‘hello world’ demonstration. For a real application you would include error handling, event driven routines, and so on. For more details, read the Java getting started tutorial and Phidgets programming manual, then look at the sample code and javadocs on this page.

Start by installing the Mac OS X Phidgets driver on your system. Once you have run Phidgets.mpkg you can open System Preferences and there will be a pane for Phidgets. For my test, I used a PhidgetInterfaceKit 8/8/8 with an LED on Output 2 and a 60mm slider (potentiometer) attached to Sensor 0. Once you have the hardware configuration you like, plug the InterfaceKit into the USB. It should show up in the General tab of system preferences. If you double click on the entry, it will start a demonstration program that allows you to make sure you can toggle the LED and get values back from the slider. When everything is working correctly, you can close the program and open Mathematica.

In a Mathematica notebook, you are going to load the J/Link package, install Java, and put the phidget21.jar file on your class path by editing the AddToClassPath[] command in the snippet below.

Needs["JLink`"]
InstallJava[]
AddToClassPath["/path/to/phidget21"]
phidgetsClass = LoadJavaClass["com.phidgets.InterfaceKitPhidget"]

Next, create a new instance of the InterfaceKit object, open it and wait for attachment. You can include a timeout value if you’d like. Once the InterfaceKit is attached, you can query it for basic information like device name, serial number and sensor and IO options.

ik = JavaNew[phidgetsClass]
ik@openAny[]
ik@waitForAttachment[]
ik@getSerialNumber[]
ik@getDeviceName[]
{ik@getOutputCount[], ik@getInputCount[], ik@getSensorCount[]}

Finally you can use Mathematica‘s Dynamic[] functionality to create a virtual slider in the notebook that will waggle back and forth as you move the physical slider attached to the InterfaceKit. You can also turn the LED on and off by clicking a dynamic checkbox in the notebook.

Slider[
 Dynamic[
  Refresh[ik@getSensorValue[0], UpdateInterval -> 0.1]], {0, 1000}]

bool=false;
Dynamic[ik@setOutputState[2, bool]]
Checkbox[Dynamic[bool]]

When you are finished experimenting, close the InterfaceKit object.

ik@close[]

Categories Making, Mathematica

Connecting Arduino to Mathematica on Mac OS X with SerialIO

2011/12/25 //

I’ve been programming regularly in Mathematica for more than a year, using the language mostly for spidering, text mining and machine learning applications. But now that I am teaching my interactive exhibit design course again, I’ve started thinking about using Mathematica for physical computing and desktop fabrication tasks. First on my to do list was to find a way to send and receive data from the Arduino. A quick web search turned up the work of Keshav Saharia, who is close to releasing a package called ArduinoLink that will make this easy. In the meantime, Keshav helped me to debug a simple demonstration that uses the SerialIO package created by Rob Raguet-Schofield. There were a few hidden gotchas involved in getting this working on Mac OS X, so I thought I would share the process with others who may be interested in doing something similar.

On the Arduino side, I attached a potentiometer to Analog 1, and then wrote a simple program that waits for a signal from the computer, reads the sensor and then sends the value back on the serial port. It is based on the Serial Call and Response tutorial on the Arduino website.

/*
 arduino_mathematica_example

 This code is adapted from
 http://arduino.cc/en/Tutorial/SerialCallResponse

 When started, the Arduino sends an ASCII A on the serial port until
 it receives a signal from the computer. It then reads Analog 1,
 sends a single byte on the serial port and waits for another signal
 from the computer.

 Test it with a potentiometer on A1.
 */

int sensor = 0;
int inByte = 0;

void setup() {
  Serial.begin(9600);
  establishContact();
}

void loop() {
  if (Serial.available() > 0) {
    inByte = Serial.read();
    // divide sensor value by 4 to return a single byte 0-255
    sensor = analogRead(A1)/4;
    delay(15);
    Serial.write(sensor);
  }
}

void establishContact() {
  while (Serial.available() <= 0) {
    Serial.print('A');
    delay(100);
  }
}

Once the sketch is installed on the Arduino, close the Arduino IDE (otherwise the device will look busy when you try to interact with it from Mathematica). On the computer side, you have to install the SerialIO package in

/Users/username/Library/Mathematica/Applications

and make sure that it is in your path. If the following command does not evaluate to True

MemberQ[$Path, "/Users/username/Library/Mathematica/Applications"]

then you need to run this command

AppendTo[$Path, "/Users/username/Library/Mathematica/Applications"]

Next, edit the file

/Users/username/Library/Mathematica/Applications/SerialIO/Kernal/init.m

so the line

$Link = Install["SerialIO"]

reads

$Link =
Install["/Users/username/Library/Mathematica/Applications/SerialIO/MacOSX/SerialIO",
LinkProtocol -> "Pipes"]

If you need to find the port name for your Arduino, you can open a terminal and type

ls /dev/tty.*

The demonstration program is shown below. You can download both the Arduino / Wiring sketch and the Mathematica notebook from my GitHub repository. You need to change the name of the serial device to whatever it is on your own machine.

<<SerialIO`

myArduino = SerialOpen["/dev/tty.usbmodem3a21"]

SerialSetOptions[myArduino, "BaudRate" -> 9600]

SerialReadyQ[myArduino]

Slider[
 Dynamic[Refresh[SerialWrite[myArduino, "B"];
  First[SerialRead[myArduino] // ToCharacterCode],
  UpdateInterval -> 0.1]], {0, 255}]

The Mathematica code loads the SerialIO package, sets the rate of the serial connection to 9600 baud to match the Arduino, and then polls the Arduino ten times per second to get the state of the potentiometer. It doesn’t matter what character we send the Arduino (here we use an ASCII B). We need to use ToCharacterCode[] to convert the response to an integer between 0 and 255. If everything worked correctly, you should see the slider wiggle back and forth in Mathematica as you turn the potentiometer. When you are finished experimenting, you need to close the serial link to the Arduino with

SerialClose[myArduino]

Categories Making, Mathematica

Designing Interactive Exhibits

2011/12/17 //

For a number of years I’ve taught a studio course in our public history graduate program on designing interactive exhibits. Most academic historians present their work in monographs and journal articles unless they are way out there on the fringe, in which case they may be experimenting with trade publications, documentary film, graphic novels, photography, websites, blogs, games or even more outré genres. Typically the emphasis remains on creating representations that are intended to be read in some sense, ideally very carefully. Public historians, however, need to be able to communicate to larger and more disparate audiences, in a wider variety of venues, and in settings where they may not have all, or even much, of the attention of their publics. Exhibits that are designed merely to be read closely are liable to be mostly ignored. When that happens, of course, it doesn’t matter how interesting your interpretation is.

Students in the course learn how to embed their interpretations in interactive, ambient and tangible forms that can be recreated in many different settings. To give some idea of the potential, consider the difference between writing with a word processor and stepping on the brake of a moving car. While using a word processor you are focused on the task and aware that you are interacting with a computer. The interface is intricate, sensorimotor involvement is mostly limited to looking and typing, and your surrounding environment recedes into the background of awareness. On the other hand, when braking you are focused on your involvement with the environment. Sensorimotor experiences are immersive, the interface to the car is as simple as possible, and you are not aware that you are interacting with computers (although recent-model cars in fact have dozens of continuously operating and networked microcontrollers).

Academic historians have tended to emphasize opportunities for knowledge dissemination that require our audience to be passive, focused and isolated from one another and from their surroundings. When we engage with a broader public, we need to supplement that model by building some of our research findings into communicative devices that are transparently easy to use, provide ambient feedback, and are closely coupled with the surrounding environment. The skills required to do this come from a number of research fields that ultimately depend on electronics and computers. Thanks to the efforts of community-minded makers, hackers, and researchers, these techniques are relatively easy to learn and apply.

Physical computing. In order to make objects or environments aware of people, to make them responsive and interactive, we need to give them a better sense of what human beings are like and what they’re capable of (Igoe & O’Sullivan 2004; Igoe 2011). Suppose your desktop computer had to guess what you look like based on your use of a word processer. It could assume that you have an eye and an ear–because you respond to things presented on the screen and to beeps–and it could assume you have a finger–because you push keys on the keyboard. To dramatize this, I usually use the image above, which is based on a drawing in Igoe and O’Sullivan (2004). It looks horrible: people are nothing like that. By giving our devices a better sense of what we’re actually like, we make it possible for them to better fit into our ongoing lifeworlds.

Pervasive computing. We are at the point where computational devices are becoming ubiquitous, invisible, part of the surroundings (McCullough 2004). The design theorist Adam Greenfield refers to this condition as “everyware” (2006). A number of technologies work together to make this possible. Embedded microprocessors put the power of full computers into tiny packages. Micro-electro-mechanical systems (MEMS) include sensors and actuators to sense and control the environment. Radio transceivers allow these miniature devices to communicate with one another and get online. Passive radio frequency ID circuits (RFIDs) are powered by radio waves to transmit identifying information. All of these systems are mass-produced so that unit costs are very low, and it becomes possible to imagine practically everything being manufactured with its own unique identifier and web address. This scenario is sometimes called the “internet of things.” Someday instead of searching for your keys you may be able to Google for them instead. As Bruce Sterling notes, practically everything in the world could become the “protagonist of a documented process” (2005). Provenance has typically had to be reconstructed painstakingly for a tiny handful of objects. Most historians are not ready to conduct research in a world where every object can tell us about its own history of manufacture, ownership, use, repair, and so on. Dealing with pervasive computation will require the ability to quickly focus on essential information, to relegate non-essential information to peripheral awareness, and to access information in the places and settings where it can make a difference.

Interaction Design. The insinuation of computation and interactivity into every conceivable setting has forced designers to abandon the traditional idea of “human-computer interaction,” and to take a much more expansive perspective instead (Moggridge 2006; Saffer 2006). Not only is everything becoming a potential interface, but many smart devices are better conceptualized as mediating between people, rather than between person and machine. Services like ordering a cup of coffee at Starbucks are now designed using the same techniques as those used to create interactive software (e.g., Google calendar) and hardware (e.g., the iPod). In order to benefit from the lessons of interaction design, historians will have to take into account the wide range of new settings where we can design experiences and shape historical consciousness. The technology of tangible computing provides a link between pervasive devices, social interaction, and the material environment (Dourish 2004).

Desktop Fabrication. Most radical of all, everything that is in digital form can be materialized, via machines that add or subtract matter. The former include a range of 3D printing technologies that deposit tiny amounts of glue, plastic or other materials, or that use lasers to selectively fuse small particles of metal, ceramic or plastic. The latter include computer-controlled milling machines, lathes, drills, grinders, laser cutters and other tools. The cost of these devices has been dropping rapidly, while their ease-of-use increases. The physicist Neil Gershenfeld has assembled a number of “fab labs”—universal fabrication laboratories—from collections of these devices. At present, a complete fab lab costs around $30-$40,000 and a few key machines are considerably cheaper (Gershenfeld 2000, 2007). Enthusiasts talk about the possibility of downloading open source plans and “printing out” a bicycle, an electric guitar, anything really. An open source hardware community is blossoming, aided in part by O’Reilly Media’s popular MAKE magazine and by websites like Instructables and Thingiverse. Desktop fabrication makes it possible to build and share custom interactive devices that communicate our knowledge in novel, material forms.

References

Dourish, Paul. Where the Action Is: The Foundations of Embodied Interaction. Cambridge, MA: MIT, 2004.
Gershenfeld, Neil. When Things Start to Think. New York: Holt, 2000.
Gershenfeld, Neil. Fab: The Coming Revolution on Your Desktop—From Personal Computers to Personal Fabrication. New York: Basic, 2007.
Greenfield, Adam. Everyware: The Dawning Age of Ubiquitous Computing. Berkeley, CA: New Riders, 2006.
Igoe, Tom. Making Things Talk, 2nd ed. Sebastopol, CA: O’Reilly, 2011.
Igoe, Tom and Dan O’Sullivan. Physical Computing: Sensing and Controlling the Physical World with Computers. Thomson Course Technology, 2004.
McCullough, Malcolm. Digital Ground: Architecture, Pervasive Computing and Environmental Knowing. Cambridge, MA: MIT, 2004.
Moggridge, Bill. Designing Interactions. Cambridge, MA: MIT, 2006.
Norretranders, Tor. The User Illusion: Cutting Consciousness Down to Size. New York: Penguin, 1999.
Saffer, Dan. Designing for Interaction: Creating Smart Applications and Clever Devices. Berkeley, CA: New Riders, 2006.
Sterling, Bruce. Shaping Things. Cambridge, MA: MIT, 2005.
Torrone, Phillip. “Open Source Hardware, What Is It? Here’s a Start…” MAKE: Blog (23 Apr 2007).

Categories Making

What’s Next for the Programming Historian

2011/11/28 //

When Alan MacEachern and I started working on the first edition of The Programming Historian in late 2007, our goal was to create an online resource that could be used by historians and other humanists to teach themselves a little bit of programming. Many introductory texts and websites approach programming languages in a systematic (if dull) way, starting with basics such as data types and gradually introducing various language constructs. This is fine if you already know how to program. Most beginners, however, are more concerned with addressing a practical need than they are with learning technical details that don’t seem to be immediately relevant. We wanted to approach programming as a means of expression. Plenty of time to begin learning grammar after you’ve had a few conversations, as it were.

Neither of us expected PH to become nearly as popular as it did. We’re still young (OK, technically we’re middle aged) but we’ll have to work pretty hard to ever gain as many readers for anything else we write. While gratifying, the success of the first edition raised new problems. Some people wanted to pitch in. Some wanted help with particular problems. Some wanted to translate the material into other natural languages or other programming languages. In the meantime, websites changed, operating systems changed, software libraries changed, programming languages changed. Change is good! But dealing with change is difficult if one or two people try to do everything by themselves. Fortunately, there is a better way.

For a couple of years now, Adam Crymble has been working with us on creating a new edition of The Programming Historian that will be open to user contributions. There will be a number of ways for people to get involved: as writers, programmers, editors, technical reviewers, testers, website hackers, graphic designers, discussants, translators, and so on. All contributions will be peer-reviewed, and everyone who participates will get credit for his or her work. One of our aims with the first edition was to maintain a narrative thread that led the user through a series of useful projects. The new edition is organized in terms of short lessons that build on knowledge acquired in previous ones. Informally, you might think of this as “choose your own adventure.” Technically the new site will be structured like a directed acyclic graph, with tools that make it easy to keep track of what you’ve learned so far and provide you with a number of choices going forward. All source code will be under version control, making it easy to maintain and fork.

Over the next few months we will be inviting beta contributors to help us design and develop the website, write and program new lessons, do editing and peer reviewing, and generally turn the goodness up to 11. There will be new lessons that lead into subjects like visualization, geospatial data, image search, integration with external tools, and the use of APIs. We will also be working with new institutional partners and exploring the connections to be made with other, similar projects. If you would like to get involved, please don’t hesitate to e-mail me and let me know. We will do a public launch when everything is working smoothly and we are ready to accept general contributions, hopefully sometime in 2012.

Categories Uncategorized

Social Network Analysis and Visualization

2011/08/02 //

In April 2008, I posted an article in my blog Digital History Hacks about visualizing the social network of NiCHE: Network in Canadian History & Environment as it was forming. We now use custom programs written in Mathematica to explore and visualize the activities of NiCHE members, and to assess our online communication strategies. Some of the data comes from our online directory, where members can contribute information about their research interests and activities. Some of it comes from our website server logs, and some of it is scraped from social networking sites like Twitter. A handful of examples are presented here, but the possibilities for this kind of analysis are nearly unbounded.

Clusters of Research Interest

When NiCHE members add their information to our online directory, they are encouraged to select one or more of a set of research interests. This heat map shows the degree of overlap for each pair of topics, with brighter colours indicating a greater number of people who indicated an interest in both areas. Some of the pairings are not surprising: people who are interested in landscape are often interested in conservation, environmentalism and parks, and vice versa. The absence of overlap is also meaningful. People who are interested in fisheries seem not to be interested in landscape, and vice versa. Why not? A workshop that tried to bring both groups together to search for common ground might lead to new insights. Studying visualizations like this one also allow us to assess the extent to which our original thematic projects (focusing on Landscapes, Forest History, Water, etc.) actually cover the interests of members. Some of the members of the NiCHE executive do research in the history and philosophy of science, and this is apparently something that many NiCHE members are also interested in. A future workshop to address this interest might be co-hosted by NiCHE and the Situating Science knowledge cluster.

Research Interest Overlaps

Looking at the same information in a different way brings new things to light. This figure shows the degree of overlap between pairs of research interests as a graph rather than a heat map. Research topics are represented as vertices, and the size of the edge connecting each pair indicates the degree of overlap. This graph suggests that NiCHE members who are interested in subjects that focus on material evidence over very long temporal durations are relatively marginal in the knowledge cluster, and may not be well connected even with one another. Again, being able to visualize the data gives us the possibility of addressing the situation. Perhaps we should make more outreach to geologists and archaeologists?

Bridging Capital

Studies of social networks suggest that their “small world” properties are typically due to people who provide bridges between interest groups or make other kinds of long-distance connections. Here we use a graph to visualize every pair of research topics that are of interest to a single NiCHE member. From this figure it is easy to see that Darin Kinsey is the only person who has claimed to be interested in both landscapes and fisheries. If we did decide to hold a workshop on the intersection of those two topics, he might be the ideal person to help organize it. If we want to try to get scholars talking to one another across methodological or thematic boundaries, then we should enlist the help of people like Ravi Ranganathan, Norm Catto and Liza Piper to get the conversation started.

Centrality of NiCHE Activities

Our online directory also allows NiCHE members to indicate which activities they have participated in. The vertices in this graph represent NiCHE members and activities that we have sponsored. If a member participated in a particular activity, there is an edge connecting the two vertices. The color of each vertex represents a measure of network centrality. Here we have labeled the most central of our activities. Note that the conferences (especially Confluences 2007 and Climate History 2008) drew relatively large numbers of participants who did not attend other NiCHE activities. Participants in the summer field schools (CHESS), on the other hand, were much more likely to attend more than one of our activities. This suggests that our field schools do a better job of helping to constitute NiCHE as an ongoing entity than regular conferences would. This is consistent with reports that we have received from regular CHESS participants, especially new scholars.

NiCHE Twitter Followers

In April 2011, NiCHE had about 340 followers on the social networking site Twitter. Each vertex in this graph represents one Twitter user. The size of the icon is scaled according to the log of the number of followers that each user has. (In this case, the number of followers range from a handful to tens of thousands, depending on the user). The edges of the graph represent some of the connections between Twitter users who follow one another. This figure shows that the NiCHE Twitter audience includes a relatively dense network of scholars who identify themselves either as digital humanists or as Canadian / environmental historians or geographers. There is also a relatively large collection of followers who do not appear to have many connections with one another. Knowing something about who is reading our tweets enables us to gauge the degree to which our online knowledge mobilization activities are effective, and helps us think about targeting our messages to particular communities.

Categories Mathematica

Simple Acoustic Data Acquisition with Mathematica

2011/06/28 //

For physical computing or amateur science projects, you often need to be able to get the output of a sensor or transducer into your computer. There are a lot of ways to do this with specialized data acquisition hardware and software. This method is pretty old school, and requires only a standard laptop and a handful of inexpensive electronic components. It may be appropriate if you are working in a relatively quiet environment and you don’t need to take a lot of samples per second.

Just about any physical signal or measurement in the world can be converted into a fluctuating voltage, an analog signal. Most laptops have a built-in microphone, so if you can convert your voltage into an audible signal, you can use the microphone to digitize it. Once it is in digital form, you can then process the signal with any programming language. Here we’ll use Mathematica.

The electronic side of the project is a very simple metronome (or tone generator) based on the 555 timer. It is a mashup of two circuits from Forrest Mims’ work, the V/F Converter on page 51 of his Science and Communication Circuits and Projects and the Audio Oscillator / Metronome on p. 22 of his Timer, Op Amp and Optoelectronic Circuits and Projects.

The changing voltage from a sensor or transducer is fed into pin 5 on the timer. For voltages between about 1.25v and 4.70v, the timer will respond with chirps or clicks that are more-or-less frequent: the frequency depends on the voltage. With lower voltages, the frequency is higher, and with higher voltages it is lower. The variable resistor can be used to tune the center of the frequency range. Increasing the value of the capacitor decreases the frequency and makes each individual chirp longer. Decreasing the value of the capacitor increases the frequency and makes each chirp shorter. For the test here, the variable resistor was set to 100K measured with a digital multimeter. The piezo element was rated for 3-20v, 10mA. You can make the circuit portable by powering it with a 9v battery, but you might have to adjust the values of the resistors and capacitor for best results. I used a bench power supply and proto-board for my experiments.

To make a sample signal, I connected the signal line to a power supply set at 4.7v, decreased it by hand to 1.25v, then increased it back to 4.7v. The circuit responded by producing slower, faster and slower clicks. If you would like to listen to the recording, there are MP3 and WAV files in the GitHub repository.

My original plan was to use the SystemDialogInput[ “RecordSound” ] command to record the audio directly into Mathematica, but that feature is unfortunately not supported on Macs yet. So I recorded the sample using Audacity instead, and then used the Import command to load the sound file into Mathematica.

My complete Mathematica notebook is available for download on GitHub. Here are the highlights.

We load the signal in as data, and plot every thousandth point to get an idea of what it looks like. Since we sampled the sound at 44100 Hz, there will be 44100 data points for each second of audio, or 44.1 data points per millisecond. The duration of each click is on the order of 10 milliseconds or less, so we won’t be able to make out individual clicks this way.

signalData = Flatten@Import[path <> "sample-data.wav", "Data"];
ListLinePlot[
   signalData[[1 ;; Length[signalData] ;; 1000]],
   ImageSize -> Full, AspectRatio -> 0.25, PlotRange -> All,
   Joined -> False, Filling -> Axis]

Near the beginning of the sample, the clicks are relatively sparse. If we start at the 3 second mark and plot every point for 50 milliseconds of our data, we can see one click.

second=44100;
msec=N[second/1000];
sparseRegion=signalData[[(3*second) ;; (3*second) + Round[50*msec]]];
ListLinePlot[sparseRegion,
    ImageSize -> Full, AspectRatio -> 0.25, PlotRange -> All,
    Joined -> False, Filling -> Axis]

We are interested in pulling out the clicks, so we take the absolute value of our data, then use the MovingAverage command to smooth it a bit. Averaging over ten data points seems to give us a sharp peak.

ListLinePlot[MovingAverage[Abs[sparseRegion], 10],
    ImageSize -> Full, AspectRatio -> 0.25, PlotRange -> All,
    Joined -> False, Filling -> Axis]

Now we can use a thresholding function that returns 1 when the signal is above a given level and 0 otherwise. We use Map to apply the function to each data point.

thresholdFunction[value_, threshold_] :=
    If[value > threshold, 1, 0]
threshold = 0.25;
ListLinePlot[
    Map[
        thresholdFunction[#, threshold] &, 
        MovingAverage[Abs[sparseRegion], 10]], 
    ImageSize -> Full, AspectRatio -> 0.25, PlotRange -> All, 
    Joined -> False, Filling -> Axis]

This looks pretty good when the clicks are infrequent. We need to make sure that it will work in a region of our signal where clicks are more frequent, too. We look at a 50-millisecond-long region 10 seconds into our signal, doing the same processing and plotting the results.

denseRegion=signalData[[(10*second) ;; (10*second) + Round[50*msec]]];

Since our thresholding looks like it will work nicely for our data, we can apply it to the whole dataset.

signalThresholded = 
    Map[thresholdFunction[#, threshold] &, 
        MovingAverage[Abs[signalData], 10]];

Now we need to count the number of clicks that occur in a given interval of time. We will do this by using the Partition command to partition our data into short, overlapping windows. Each window will be 200 milliseconds long, and will overlap those on its left and right by 100 milliseconds each. To count clicks, we will use the Split command to split each window into runs of similar characters (i.e., all 1s or all 0s) then Map the Total function over each of the resulting lists. If there are more than, say, five 1s in a row, surrounded on either side by a run of 0s, we can assume that a click has occurred.

signalThresholdedPartition = 
    Partition[signalThresholded, Round[200*msec], Round[100*msec]];
countClicks[window_] :=
    Total[
        Map[If[# > 5, 1, 0] &, 
            Map[Total, Split[window]]]]

We can now plot the number of clicks per window. This shows us that the voltage starts high, drops, then increases again. Note the spike in the middle, which represents a dense cluster of clicks. Looking at the raw signal, this appears to have been a burst of noise. At this point, we have a workable way to convert a varying voltage into a digital signal that we can analyze with Mathematica. For most applications, the next step would be to calibrate the system by recording audio samples for specific voltages and determining how many clicks per window was associated with each.

ListLinePlot[Map[countClicks, signalThresholdedPartition]]

Categories Mathematica

Being Byproductive

2011/05/15 //

In Sex, Drugs and Cocoa Puffs, Chuck Klosterman has an essay that describes his experience playing The Sims:

My SimChuck has absolutely no grit. He is constantly bummed out, forever holding his head and whining about how he’s ‘not comfortable’ or ‘not having fun.’ At one point I bought him a pretty respectable wall mirror for $300, and he responded by saying ‘I’m too depressed to even look at myself.’ As an alternative, he sat on the couch and stared at the bathroom door. … I hope I never own a bed. But don’t tell that to SimChuck. Until I bought him his $1,000 Napoleon Sleigh Bed (‘made with actual wood and real aromatic cedar’), all he did was cry like a little bitch.

Now you might be thinking to yourself that this doesn’t sound very productive, but surely, if there’s one thing that we know about computer games, it’s that they’re supposed to teach us the skills that we need to get by in our real lives. So instead of trying to instill some grit in a Sim, what happens if you apply the same thinking to your own routine?

Take any simple thing that you already know how to do–eat, brush your teeth, dress yourself, answer the phone, write a book (any of the things that you’d think shouldn’t faze your Sim)–take one of those things and measure it along any dimension(s) that you care about. Then break it into a bunch of tiny moving parts, and try to improve them one at a time. When you’ve put the process back together, measure it again to see if your changes made a difference. Revert any change that didn’t improve things.

Programmers call this “refactoring,” but really it isn’t a new idea. In Walden, Thoreau wrote that “To affect the quality of the day, that is the highest of arts.” Academics should already know that process is more important than product, as the idea shows up in various guises throughout the canon, but its surprising how few of them act like they know it. Really, your CV shares the same relationship to a life of research and teaching as a coprolite does to a good meal.

When I return from my sabbatical, the second thing that people are going to ask me is “was it productive?” I think I’m going to say, “Well… it was byproductive.” And then we’ll see where the conversation goes from there.

Categories Kaizen

What is the New Manufactory?

2011/05/03 //

From 2005 to 2008 I kept a research weblog called Digital History Hacks. As an open access, open content publishing platform, the blog served my purposes nicely. It was less well suited to sharing open source code, however, and didn’t support humanistic fabrication at all. Adding a private wiki helped a little bit, but didn’t go nearly far enough. So I decided to build The New Manufactory around the following ideas.

Working in the Heraclitean mode. If it ever made sense to divide scholarship into phases of research and writing, it no longer does. We now have to work in a mode where things around us are constantly changing, and we’re trying to do everything, all the time. As Heraclitus supposedly said, “all is flux.” So until your interpretation stabilizes…

You keep refining your ensemble of questions
Your spiders and feeds provide a constant stream of potential sources
Unsupervised learning methods reveal clusters which help to direct your attention
Adaptive filters track your interests as they fluctuate
You create or contribute to open source software as needed
You write/publish incrementally in an open access venue
Your research process is subject to continual peer review
Your reputation develops

Assemblages and rhizomes. Digital scholarship adds algorithms, source code, digital representations, version control, networked collaborators, application programming interfaces, simulations, machine learners, visualizations and a slew of other new things to philology. Beyond that, fabrication adds tools, instruments, materials, machines, workbenches, techniques, feedstock, fasteners, electronics, numerical control, and so on. The things that we have to figure out don’t come in neat packages or fit into hierarchies.

A tight loop between digitization and materialization. Digital representations have a number of well known qualities: they’re perfectly plastic, can be duplicated almost without cost, transmitted in the blink of an eye and stored in vanishingly small physical spaces. Every digital source can also be the subject of computational processing. So it makes a lot of sense to create and share digital records. Freeing information in this social sense, however, doesn’t mean that information can or should always be divorced from material objects or particular settings. We can link the digital and material with GPS, RFIDs, radio triangulation, barcodes, computer vision or embedded network servers. We can augment the material world with digital sources, and increasingly we can materialize digital sources with 3D printers or inexpensive CNC mills and lathes.

Everything should be self-documenting. Recording devices like sensors, scanners and cameras can be built into the workbench. They can automatically upload and archive photographs, videos, audio files and other kinds of digital representations. Electronic instruments can be polled for measurements. Machine tools can report their status via syndicated feeds. When we make and use things in the world, we can make and make use of born-digital data, too.

Co-presence and telepresence. Human beings (and other primates) learn by watching one another’s eyes and hands, so why not make it possible to do this remotely? Workers at a pair of augmented workbenches could be made aware of one another’s actions by a stream of low-latency signals. Sensors and actuators could support remote gesture, touch, manipulation and a sense of presence. Services like Pachube can be used to leverage this information, so that it can be remixed and repurposed.

Categories Making

Basic Text Analysis in Mathematica

Introduction

Using a built-in text

Extracting part of a string

From a string to a list of words

Processing each element in a list

Computing word frequencies

Aside: Anonymous Functions

N-grams

Concordance (Keyword in Context)

Removing stop words

Bigrams containing the most frequent words

Document frequencies

TF-IDF: Term frequency-Inverse document frequency

A Simple Algorithm for Finding Images in E-books

Connecting Phidgets to Mathematica on Mac OS X with J/Link

Connecting Arduino to Mathematica on Mac OS X with SerialIO

Designing Interactive Exhibits

What’s Next for the Programming Historian

Social Network Analysis and Visualization

Simple Acoustic Data Acquisition with Mathematica

Being Byproductive

What is the New Manufactory?

Projects

Recent

Archives

Categories