Archives for category: Mathematica

Text and Image Mining for Historical Research (Wolfram Virtual Technology Conference 2020)

2021/08/23 //

Scholars in history and other humanities tend to work differently than their colleagues in science and engineering. Working alone or occasionally in small groups, their focus is typically on the close reading of sources, including text and images. Mathematica’s high-level commands and holistic approach make it possible for a single programmer to implement relatively complex research tools that are customized for a particular study. Here William J Turkel describes a few examples, including the mining and conceptual analysis of text, and image mining applied to electronic schematics, photographs of bridges and the visual culture of stage magic. He also discuss the opportunities and challenges of teaching these kinds of skills to university students with little or no technical background.

Categories Mathematica, Method, Uncategorized

Basic Text Analysis in Mathematica

2012/11/19 //

(If you have Mathematica you can download this as a notebook from my GitHub account. It is also available as a CDF document which can be read with Wolfram’s free CDF Player.)

Introduction

For a couple of years now I have been using Mathematica as my programming language of choice for my digital history work. For one thing, I love working with notebooks, which allow me to mix prose, citations, live data, executable code, manipulable simulations and other elements in a single document. I also love the generality of Mathematica. For any kind of technical work, there is usually a well-developed body of theory that is expressed in objects drawn from some branch of mathematics. Chances are, Mathematica already has a large number of high-level functions for working with those mathematical objects. The Mathematica documentation is excellent, if necessarily sprawling, since there are literally thousands of commands. The challenge is usually to find the commands that you need to solve a given problem. Since few Mathematica programmers seem to be working historians or humanists dealing with textual sources, it can be difficult to figure out where to begin.

Using a built-in text

As a sample text, we will use the Darwin’s Origin of Species from Mathematica‘s built-in example database. The Short command shows a small piece of something large. Here we’re asking to see the two lines at the beginning and end of this text.

sample = ExampleData[{"Text", "OriginOfSpecies"}];
Short[sample, 2]

Mathematica responds with

INTRODUCTION. When on board H.M.S. ... have been, and are being, evolved.

The Head command tells us what something is. Our text is currently a string, an ordered sequence of characters.

Head[sample]

String

Extracting part of a string

Suppose we want to work with part of the text. We can extract the Introduction of Origin by pulling out everything between “INTRODUCTION” and “CHAPTER 1”. The command that we use to extract part of a string is called StringCases. Once we have extracted the Introduction, we want to check to make sure that the command worked the way we expected. Rather than look at the whole text right now, we can use the Short command to show us about five line of the text. It returns a couple of phrases at the beginning and end, using ellipses to indicate the much larger portion which we are not seeing.

intro=StringCases[sample,Shortest["INTRODUCTION"~~__~~"CHAPTER"]][[1]];
Short[intro,5]

INTRODUCTION. When on board H.M.S. 'Beagle,' as naturalist, I was much struck with certain fac... nced that Natural Selection has been the main but not exclusive means of modification. CHAPTER

Note the use of the Shortest command in the string matching expression above. Since there are probably multiple copies of the word “CHAPTER” in the text, we have to tell Mathematica how much of the text we want to match… do we want the portion between “INTRODUCTION” and the first instance of the word, the second, the last? Here are two examples to consider:

StringCases["bananarama","b"~~__~~"a"]

{bananarama}

StringCases["bananarama",Shortest["b"~~__~~"a"]]

{bana}

From a string to a list of words

It will be easier for us to analyze the text if we turn it into a list of words. In order to eliminate punctuation, I am going to get rid of everything that is not a word character. Note that doing things this way turns the abbreviation “H.M.S.” into three separate words.

introList=StringSplit[intro,Except[WordCharacter]..];
Short[introList,4]

{INTRODUCTION,When,on,board,H,M,S,Beagle,as,<>,the,main,but,not,exclusive,means,of,modification,CHAPTER}

Mathematica has a number of commands for selecting elements from lists. The Take command allows us to extract a given number of items from the beginning of a list.

Take[introList,40]

{INTRODUCTION,When,on,board,H,M,S,Beagle,as,naturalist,I,was,much,struck,
with,certain,facts,in,the,distribution,of,the,inhabitants,of,South,America,
and,in,the,geological,relations,of,the,present,to,the,past,inhabitants,
of,that}

The First command returns the first item in a list, and the Rest command returns everything but the first element. The Last command returns the last item.

First[introList]

INTRODUCTION

Short[Rest[introList]]

{When,on,<>,modification,CHAPTER}

Last[introList]

CHAPTER

We can also use an index to pull out list elements.

introList[[8]]

Beagle

We can test whether or not a given item is a member of a list with the MemberQ command.

MemberQ[introList, "naturalist"]

True

MemberQ[introList, "naturist"]

False

Processing each element in a list

If we want to apply some kind of function to every element in a list, the most natural way to accomplish this in Mathematica is with the Map command. Here we show three examples using the first 40 words of the Introduction. Note that Map returns a new list rather than altering the original one.

Map[ToUpperCase, Take[introList, 40]]

{INTRODUCTION,WHEN,ON,BOARD,H,M,S,BEAGLE,AS,NATURALIST,I,WAS,MUCH,STRUCK,
WITH,CERTAIN,FACTS,IN,THE,DISTRIBUTION,OF,THE,INHABITANTS,OF,SOUTH,AMERICA,
AND,IN,THE,GEOLOGICAL,RELATIONS,OF,THE,PRESENT,TO,THE,PAST,INHABITANTS,OF,
THAT}

Map[ToLowerCase, Take[introList, 40]]

{introduction,when,on,board,h,m,s,beagle,as,naturalist,i,was,much,struck,
with,certain,facts,in,the,distribution,of,the,inhabitants,of,south,america,
and,in,the,geological,relations,of,the,present,to,the,past,inhabitants,of,
that}

Map[StringLength, Take[introList, 40]]

{12,4,2,5,1,1,1,6,2,10,1,3,4,6,4,7,5,2,3,12,2,3,11,2,5,7,3,2,3,10,9,2,3,7,
2,3,4,11,2,4}

Computing word frequencies

In order to compute word frequencies, we first convert all words to lowercase, the sort them and count how often each appears using the Tally command. This gives us a list of lists, where each of the smaller lists contains a single word and its frequency.

lowerIntroList=Map[ToLowerCase,introList];
sortedIntroList=Sort[lowerIntroList];
wordFreq=Tally[sortedIntroList];
Short[wordFreq]

{{1837,1},{1844,2},<>,{years,3},{yet,2}}

Finally we can sort our tally list by the frequency of each item. This is traditionally done in descending order. In Mathematica we can change the sort order by passing the Sort command an anonymous function. (It isn’t crucial for this example to understand exactly how this works, but it is explained in the next section if you are curious. If not, just skip ahead.)

sortedFrequencyList = Sort[wordFreq, #1[[2]] > #2[[2]] &];
Short[sortedFrequencyList, 8]

{{the,100},{of,91},{to,54},{and,52},{i,44},{in,37},{that,27},{a,24},
{this,20},{it,20},{be,20},{which,18},<>,{admiration,1},{admirably,1},
{adduced,1},{adapted,1},{acquire,1},{acknowledging,1},{acknowledged,1},
{accuracy,1},{account,1},{absolutely,1},{1837,1}}

Here are the twenty most frequent words:

Take[sortedFrequencyList, 20]

{{"the", 100}, {"of", 91}, {"to", 54}, {"and", 52}, {"i", 44},
{"in", 37}, {"that", 27}, {"a", 24}, {"this", 20}, {"it", 20},
{"be", 20}, {"which", 18}, {"have", 18}, {"species", 17},
{"on", 17}, {"is", 17}, {"as", 17}, {"my", 13}, {"been", 13},
{"for", 11}}

The Cases statement pulls every item from a list that matches a pattern. Here we are looking to see how often the word “modification” appears.

Cases[wordFreq, {"modification", _}]

{{"modification", 4}}

Aside: Anonymous Functions

Most programming languages let you define new functions, and Mathematica is no exception. You can use these new functions with built-in commands like Map.

plus2[x_]:=
Return[x+2]

Map[plus2, {1, 2, 3}]

{3, 4, 5}

Being able to define functions allows you to

hide details: as long as you can use a function like plus2 you may not care how it works
reuse and share code: so you don’t have to keep reinventing the wheel.

In Mathematica, you can also create anonymous functions. One way of writing an anonymous function in Mathematica is to use a Slot in place of a variable.

# + 2 &

So we don’t have to define our function in advance, we can just write it where we need it.

Map[# + 2 &, {1, 2, 3}]

{3, 4, 5}

We can apply an anonymous function to an argument like this, and Mathematica will return the number 42.

(# + 2 &)[40]

A named function like plus2 is still sitting there when we’re done with it. An anonymous function disappears immediately after use.

N-grams

The Partition command can be used to create n-grams. This tells Mathematica to give us all of the partitions of a list that are two elements long and that are offset by one.

bigrams = Partition[lowerIntroList, 2, 1];
Short[bigrams, 8]

{{introduction,when},{when,on},{on,board},{board,h},{h,m},{m,s},{s,beagle},
{beagle,as},{as,naturalist},<>,{been,the},{the,main},{main,but},
{but,not},{not,exclusive},{exclusive,means},{means,of},{of,modification},
{modification,chapter}}

We can tally and sort bigrams, too.

sortedBigrams = Sort[Tally[bigrams], #1[[2]] > #2[[2]] &];
Short[sortedBigrams, 8]

{{{of,the},21},{{in,the},13},{{i,have},11},{{to,the},11},{{which,i},7},
{{to,me},7},{{of,species},6},{{i,shall},5},<>,{{beagle,as},1},
{{s,beagle},1},{{m,s},1},{{h,m},1},{{board,h},1},{{on,board},1},
{{when,on},1},{{introduction,when},1}}

Concordance (Keyword in Context)

A concordance shows keywords in the context of surrounding words. We can make one of these quite easily if we starting by generating n-grams. Then we use Cases to pull out all of the 5-grams in the Introduction that have “organic” as the middle word (for example), and format the output with the TableForm command.

fivegrams=Partition[lowerIntroList,5,1];
TableForm[Cases[fivegrams,{_,_,"organic",_,_}]]

affinities of    organic beings on
several distinct organic beings by
coadaptations of organic beings to
amongst all      organic beings throughout
succession of    organic beings throughout

Removing stop words

Mathematica has access to a lot of built-in, curated data. Here we grab a list of English stopwords.

stopWords = WordData[All, "Stopwords"];
Short[stopWords, 4]

{0,1,2,3,4,5,6,7,8,9,a,A,about,above,<<234>>,with,within,without,would,x,X,y,Y,yet,you,your,yours,z,Z}

The Select command allows us to use a function to pull items from a list. We want everything that is not a member of the list of stop words.

Short[lowerIntroList, 8]

{introduction,when,on,board,h,m,s,beagle,as,naturalist,i,was,much,struck,
with,certain,facts,in,the,<<1676>>,species,furthermore,i,am,convinced,
that,natural,selection,has,been,the,main,but,not,exclusive,means,of,
modification,chapter}

lowerIntroNoStopwords = 
  Select[lowerIntroList, Not[MemberQ[stopWords, #]] &];
Short[lowerIntroNoStopwords, 8]

{introduction,board,beagle,naturalist,struck,certain,facts,distribution,
inhabitants,south,america,geological,relations,present<<697>>,species,
descendants,species,furthermore,am,convinced,natural,selection,main,
exclusive,means,modification,chapter}

Bigrams containing the most frequent words

Here is a more complicated example built mostly from functions we’ve already seen. We start by finding the ten most frequently occuring words once we have gotten rid of stop words.

freqWordCounts = 
 Take[Sort[
   Tally[Take[
     lowerIntroNoStopwords, {1, -120}]], #1[[2]] > #2[[2]] &], 10]

{{shall,9},{species,9},{facts,9},{chapter,5},{variation,5},
{conditions,5},{beings,5},{organic,5},{conclusions,5},{subject,5}}

We remove a few of the words we are not interested in, then we rewrite the bigrams as a list of graph edges. This will be useful for visualizing the results as a network.

freqWords = 
  Complement[Map[First, freqWordCounts], {"shall", "subject"}];
edgeList = 
  Map[#[[1]] -> #[[2]] &, Partition[lowerIntroNoStopwords, 2, 1]];
Short[edgeList, 4]

{introduction->board,board->beagle,beagle->naturalist,<<717>>,
exclusive->means,means->modification,modification->chapter}

We grab the most frequent ones.

freqBigrams = Union[Select[edgeList, MemberQ[freqWords, #[[1]]] &],
   Select[edgeList, MemberQ[freqWords, #[[2]]] &]];
Short[freqBigrams, 4]

{abstract->variation,affinities->organic,allied->species,<<87>>,varieties->species,varying->conditions,volume->facts}

Finally we can visualize the results as a network. When you are exploring a text this way, you often want to keep tweaking your parameters and see if anything interesting comes up.

Framed[Pane[
  GraphPlot[freqBigrams, 
   Method -> {"SpringElectricalEmbedding", 
     "InferentialDistance" -> .1, "RepulsiveForcePower" -> -4}, 
   VertexLabeling -> True, DirectedEdges -> True, 
   ImageSize -> {1100, 800}], {400, 400}, Scrollbars -> True, 
  ScrollPosition -> {400, 200}]]

Document frequencies

We have been looking at the Introduction to Origin. We can also calculate word frequencies for the whole document. When we list the fifty most common words (not including stop words) we can get a better sense of what the whole book is about.

sampleList = 
  Map[ToLowerCase, StringSplit[sample, Except[WordCharacter] ..]];
docFreq = Sort[Tally[Sort[sampleList]], #1[[2]] > #2[[2]] &];
Take[Select[Take[docFreq, 200], 
  Not[MemberQ[stopWords, First[#]]] &], 50]

{{species,1489},{forms,397},{varieties,396},{selection,383},
{natural,361},{life,298},{plants,297},{different,282},{case,281},
{animals,280},{great,260},{distinct,255},{nature,253},{having,252},
{new,244},{long,243},{period,238},{cases,224},{believe,216},
{structure,214},{conditions,211},{genera,210},{generally,199},
{number,198},{common,194},{far,193},{time,191},{degree,190},
{groups,173},{characters,170},{certain,169},{view,168},{large,168},
{instance,165},{modification,161},{facts,157},{closely,155},
{parts,154},{intermediate,154},{modified,153},{genus,147},
{present,143},{birds,143},{produced,141},{individuals,140},
{inhabitants,139},{parent,138},{world,136},{character,136},
{organic,135}}

TF-IDF: Term frequency-Inverse document frequency

The basic intuition behind tf-idf is as follows…

A word that occurs frequently on every page doesn’t tell you anything special about that page. It is a stop word.
A word that occurs only a few times in the whole document or corpus can be ignored.
A word that occurs a number of times on one page but is relatively rare in the document or corpus overall can give you some idea what the page is about.

Here is one way to calculate tf-idf (there are lots of different versions)

tfidf[termfreq_,docfreq_,numdocs_]:=
     Log[termfreq+1.0] Log[numdocs/docfreq]

Using document frequencies and TF-IDF we can get a sense of what different parts of a text are about. Here is how we would analyze chapter 9 (there are 15 chapters in all).

ch9 = StringCases[sample, Shortest["CHAPTER 9" ~~ __ ~~ "CHAPTER"]][[
   1]];
ch9List = Map[ToLowerCase, StringSplit[ch9, Except[WordCharacter] ..]];
ch9Terms = Union[ch9List];
ch9TermFreq = Sort[Tally[ch9List], #1[[2]] > #2[[2]] &];
ch9DocFreq = Select[docFreq, MemberQ[ch9Terms, #[[1]]] &];

computeTFIDF[termlist_, tflist_, dflist_] :=
 Module[{outlist, tf, df},
  outlist = {};
  Do[
   tf = Cases[tflist, {t, x_} -> x][[1]];
   df = Cases[dflist, {t, x_} -> x][[1]];
   outlist = Append[outlist, {t, tf, df, tfidf[tf, df, 15.0]}],
   {t, termlist}];
  Return[outlist]]

ch9TFIDF = 
  Sort[computeTFIDF[ch9Terms, ch9TermFreq, 
    ch9DocFreq], #1[[4]] > #2[[4]] &];
Take[ch9TFIDF, 50][[All, 1]]

{teleostean,tapir,richest,pebbles,mississippi,downs,decay,conchologists,
wear,thinner,tear,supplement,superimposed,sedgwick,rolled,poorness,nodules,
mineralogical,levels,inadequate,grinding,gravel,downward,denuded,comprehend,
chthamalus,atom,accumulations,sand,ramsay,littoral,sedimentary,wears,wearing,
wealden,watermark,watch,vehemently,valve,upright,unimproved,unfathomable,
undermined,underlies,unanimously,ubiquitous,transmutation,tides,tidal,swarmed}

Whether or not you are familiar with nineteenth-century science, it should be clear that the chapter has something to do with geology. Darwin also provided chapter summaries of his own:

StringTake[ch9, 548]

CHAPTER 9. ON THE IMPERFECTION OF THE GEOLOGICAL RECORD. On the absence 
of intermediate varieties at the present day. On the nature of extinct 
intermediate varieties; on their number. On the vast lapse of time, as 
inferred from the rate of deposition and of denudation. On the poorness 
of our palaeontological collections. On the intermittence of geological 
formations. On the absence of intermediate varieties in any one formation. 
On the sudden appearance of groups of species. On their sudden appearance 
in the lowest known fossiliferous strata.

Categories Mathematica

A Simple Algorithm for Finding Images in E-books

2012/11/09 //

In September, Tim Hitchcock and I had a chance to meet with Adam Farquhar at the British Library to talk about potential collaborative research projects. Adam suggested that we might do something with a collection of about 25,000 E-books. Although I haven’t had much time yet to work with the sources, one of the things that I am interested in is using techniques from image processing and computer vision to supplement text mining. As an initial project, I decided to see if I could find a way to automatically extract images from the collection.

My first thought was that I might be able to identify text based on its horizontal and vertical correlation. Parts of the image that were not text would then be whitespace, illustration or marginalia. (One way to do this in Mathematica is to use the ImageCooccurence function). As I was trying to figure out the details, however, I realized that a much simpler approach might work. Since the method seems promising I decided to share it so that other people might get some use out of it (or suggest improvements).

In a British Library E-book, each page has a JPEG page image and an associated ALTO (XML) file which contains the OCRed text. The basic idea is to compare the JPEG image file size with the ALTO file size for the same page. Pages that have a lot of text (and no images) should have large ALTO files relative to the size of the JPEG. Pages with an image but little or no text should have a large JPEG relative to the size of the ALTO file. Blank pages should have relatively small JPEG and ALTO files.

The graph below shows the results for an E-book chosen at random from the sample. Textual pages cluster, pages with images tend to cluster, and blank pages (and covers) fall out along one axis because they have no text at all. We can use more sophisticated image processing and machine learning to further subdivide images and extract them once they are located, but this seems pretty good for a first pass.

Categories Mathematica, Method

Connecting Phidgets to Mathematica on Mac OS X with J/Link

2011/12/28 //

In my previous post, I showed how to connect an Arduino microcontroller to Mathematica on Mac OS X using the SerialIO package. It is also quite straightforward to interact with Phidgets. In this case we can take advantage of Mathematica’s J/Link Java interface to call the Phidgets API. This is basically a ‘hello world’ demonstration. For a real application you would include error handling, event driven routines, and so on. For more details, read the Java getting started tutorial and Phidgets programming manual, then look at the sample code and javadocs on this page.

Start by installing the Mac OS X Phidgets driver on your system. Once you have run Phidgets.mpkg you can open System Preferences and there will be a pane for Phidgets. For my test, I used a PhidgetInterfaceKit 8/8/8 with an LED on Output 2 and a 60mm slider (potentiometer) attached to Sensor 0. Once you have the hardware configuration you like, plug the InterfaceKit into the USB. It should show up in the General tab of system preferences. If you double click on the entry, it will start a demonstration program that allows you to make sure you can toggle the LED and get values back from the slider. When everything is working correctly, you can close the program and open Mathematica.

In a Mathematica notebook, you are going to load the J/Link package, install Java, and put the phidget21.jar file on your class path by editing the AddToClassPath[] command in the snippet below.

Needs["JLink`"]
InstallJava[]
AddToClassPath["/path/to/phidget21"]
phidgetsClass = LoadJavaClass["com.phidgets.InterfaceKitPhidget"]

Next, create a new instance of the InterfaceKit object, open it and wait for attachment. You can include a timeout value if you’d like. Once the InterfaceKit is attached, you can query it for basic information like device name, serial number and sensor and IO options.

ik = JavaNew[phidgetsClass]
ik@openAny[]
ik@waitForAttachment[]
ik@getSerialNumber[]
ik@getDeviceName[]
{ik@getOutputCount[], ik@getInputCount[], ik@getSensorCount[]}

Finally you can use Mathematica‘s Dynamic[] functionality to create a virtual slider in the notebook that will waggle back and forth as you move the physical slider attached to the InterfaceKit. You can also turn the LED on and off by clicking a dynamic checkbox in the notebook.

Slider[
 Dynamic[
  Refresh[ik@getSensorValue[0], UpdateInterval -> 0.1]], {0, 1000}]

bool=false;
Dynamic[ik@setOutputState[2, bool]]
Checkbox[Dynamic[bool]]

When you are finished experimenting, close the InterfaceKit object.

ik@close[]

Categories Making, Mathematica

Connecting Arduino to Mathematica on Mac OS X with SerialIO

2011/12/25 //

I’ve been programming regularly in Mathematica for more than a year, using the language mostly for spidering, text mining and machine learning applications. But now that I am teaching my interactive exhibit design course again, I’ve started thinking about using Mathematica for physical computing and desktop fabrication tasks. First on my to do list was to find a way to send and receive data from the Arduino. A quick web search turned up the work of Keshav Saharia, who is close to releasing a package called ArduinoLink that will make this easy. In the meantime, Keshav helped me to debug a simple demonstration that uses the SerialIO package created by Rob Raguet-Schofield. There were a few hidden gotchas involved in getting this working on Mac OS X, so I thought I would share the process with others who may be interested in doing something similar.

On the Arduino side, I attached a potentiometer to Analog 1, and then wrote a simple program that waits for a signal from the computer, reads the sensor and then sends the value back on the serial port. It is based on the Serial Call and Response tutorial on the Arduino website.

/*
 arduino_mathematica_example

 This code is adapted from
 http://arduino.cc/en/Tutorial/SerialCallResponse

 When started, the Arduino sends an ASCII A on the serial port until
 it receives a signal from the computer. It then reads Analog 1,
 sends a single byte on the serial port and waits for another signal
 from the computer.

 Test it with a potentiometer on A1.
 */

int sensor = 0;
int inByte = 0;

void setup() {
  Serial.begin(9600);
  establishContact();
}

void loop() {
  if (Serial.available() > 0) {
    inByte = Serial.read();
    // divide sensor value by 4 to return a single byte 0-255
    sensor = analogRead(A1)/4;
    delay(15);
    Serial.write(sensor);
  }
}

void establishContact() {
  while (Serial.available() <= 0) {
    Serial.print('A');
    delay(100);
  }
}

Once the sketch is installed on the Arduino, close the Arduino IDE (otherwise the device will look busy when you try to interact with it from Mathematica). On the computer side, you have to install the SerialIO package in

/Users/username/Library/Mathematica/Applications

and make sure that it is in your path. If the following command does not evaluate to True

MemberQ[$Path, "/Users/username/Library/Mathematica/Applications"]

then you need to run this command

AppendTo[$Path, "/Users/username/Library/Mathematica/Applications"]

Next, edit the file

/Users/username/Library/Mathematica/Applications/SerialIO/Kernal/init.m

so the line

$Link = Install["SerialIO"]

reads

$Link =
Install["/Users/username/Library/Mathematica/Applications/SerialIO/MacOSX/SerialIO",
LinkProtocol -> "Pipes"]

If you need to find the port name for your Arduino, you can open a terminal and type

ls /dev/tty.*

The demonstration program is shown below. You can download both the Arduino / Wiring sketch and the Mathematica notebook from my GitHub repository. You need to change the name of the serial device to whatever it is on your own machine.

<<SerialIO`

myArduino = SerialOpen["/dev/tty.usbmodem3a21"]

SerialSetOptions[myArduino, "BaudRate" -> 9600]

SerialReadyQ[myArduino]

Slider[
 Dynamic[Refresh[SerialWrite[myArduino, "B"];
  First[SerialRead[myArduino] // ToCharacterCode],
  UpdateInterval -> 0.1]], {0, 255}]

The Mathematica code loads the SerialIO package, sets the rate of the serial connection to 9600 baud to match the Arduino, and then polls the Arduino ten times per second to get the state of the potentiometer. It doesn’t matter what character we send the Arduino (here we use an ASCII B). We need to use ToCharacterCode[] to convert the response to an integer between 0 and 255. If everything worked correctly, you should see the slider wiggle back and forth in Mathematica as you turn the potentiometer. When you are finished experimenting, you need to close the serial link to the Arduino with

SerialClose[myArduino]

Categories Making, Mathematica

Social Network Analysis and Visualization

2011/08/02 //

In April 2008, I posted an article in my blog Digital History Hacks about visualizing the social network of NiCHE: Network in Canadian History & Environment as it was forming. We now use custom programs written in Mathematica to explore and visualize the activities of NiCHE members, and to assess our online communication strategies. Some of the data comes from our online directory, where members can contribute information about their research interests and activities. Some of it comes from our website server logs, and some of it is scraped from social networking sites like Twitter. A handful of examples are presented here, but the possibilities for this kind of analysis are nearly unbounded.

Clusters of Research Interest

When NiCHE members add their information to our online directory, they are encouraged to select one or more of a set of research interests. This heat map shows the degree of overlap for each pair of topics, with brighter colours indicating a greater number of people who indicated an interest in both areas. Some of the pairings are not surprising: people who are interested in landscape are often interested in conservation, environmentalism and parks, and vice versa. The absence of overlap is also meaningful. People who are interested in fisheries seem not to be interested in landscape, and vice versa. Why not? A workshop that tried to bring both groups together to search for common ground might lead to new insights. Studying visualizations like this one also allow us to assess the extent to which our original thematic projects (focusing on Landscapes, Forest History, Water, etc.) actually cover the interests of members. Some of the members of the NiCHE executive do research in the history and philosophy of science, and this is apparently something that many NiCHE members are also interested in. A future workshop to address this interest might be co-hosted by NiCHE and the Situating Science knowledge cluster.

Research Interest Overlaps

Looking at the same information in a different way brings new things to light. This figure shows the degree of overlap between pairs of research interests as a graph rather than a heat map. Research topics are represented as vertices, and the size of the edge connecting each pair indicates the degree of overlap. This graph suggests that NiCHE members who are interested in subjects that focus on material evidence over very long temporal durations are relatively marginal in the knowledge cluster, and may not be well connected even with one another. Again, being able to visualize the data gives us the possibility of addressing the situation. Perhaps we should make more outreach to geologists and archaeologists?

Bridging Capital

Studies of social networks suggest that their “small world” properties are typically due to people who provide bridges between interest groups or make other kinds of long-distance connections. Here we use a graph to visualize every pair of research topics that are of interest to a single NiCHE member. From this figure it is easy to see that Darin Kinsey is the only person who has claimed to be interested in both landscapes and fisheries. If we did decide to hold a workshop on the intersection of those two topics, he might be the ideal person to help organize it. If we want to try to get scholars talking to one another across methodological or thematic boundaries, then we should enlist the help of people like Ravi Ranganathan, Norm Catto and Liza Piper to get the conversation started.

Centrality of NiCHE Activities

Our online directory also allows NiCHE members to indicate which activities they have participated in. The vertices in this graph represent NiCHE members and activities that we have sponsored. If a member participated in a particular activity, there is an edge connecting the two vertices. The color of each vertex represents a measure of network centrality. Here we have labeled the most central of our activities. Note that the conferences (especially Confluences 2007 and Climate History 2008) drew relatively large numbers of participants who did not attend other NiCHE activities. Participants in the summer field schools (CHESS), on the other hand, were much more likely to attend more than one of our activities. This suggests that our field schools do a better job of helping to constitute NiCHE as an ongoing entity than regular conferences would. This is consistent with reports that we have received from regular CHESS participants, especially new scholars.

NiCHE Twitter Followers

In April 2011, NiCHE had about 340 followers on the social networking site Twitter. Each vertex in this graph represents one Twitter user. The size of the icon is scaled according to the log of the number of followers that each user has. (In this case, the number of followers range from a handful to tens of thousands, depending on the user). The edges of the graph represent some of the connections between Twitter users who follow one another. This figure shows that the NiCHE Twitter audience includes a relatively dense network of scholars who identify themselves either as digital humanists or as Canadian / environmental historians or geographers. There is also a relatively large collection of followers who do not appear to have many connections with one another. Knowing something about who is reading our tweets enables us to gauge the degree to which our online knowledge mobilization activities are effective, and helps us think about targeting our messages to particular communities.

Categories Mathematica

Simple Acoustic Data Acquisition with Mathematica

2011/06/28 //

For physical computing or amateur science projects, you often need to be able to get the output of a sensor or transducer into your computer. There are a lot of ways to do this with specialized data acquisition hardware and software. This method is pretty old school, and requires only a standard laptop and a handful of inexpensive electronic components. It may be appropriate if you are working in a relatively quiet environment and you don’t need to take a lot of samples per second.

Just about any physical signal or measurement in the world can be converted into a fluctuating voltage, an analog signal. Most laptops have a built-in microphone, so if you can convert your voltage into an audible signal, you can use the microphone to digitize it. Once it is in digital form, you can then process the signal with any programming language. Here we’ll use Mathematica.

The electronic side of the project is a very simple metronome (or tone generator) based on the 555 timer. It is a mashup of two circuits from Forrest Mims’ work, the V/F Converter on page 51 of his Science and Communication Circuits and Projects and the Audio Oscillator / Metronome on p. 22 of his Timer, Op Amp and Optoelectronic Circuits and Projects.

The changing voltage from a sensor or transducer is fed into pin 5 on the timer. For voltages between about 1.25v and 4.70v, the timer will respond with chirps or clicks that are more-or-less frequent: the frequency depends on the voltage. With lower voltages, the frequency is higher, and with higher voltages it is lower. The variable resistor can be used to tune the center of the frequency range. Increasing the value of the capacitor decreases the frequency and makes each individual chirp longer. Decreasing the value of the capacitor increases the frequency and makes each chirp shorter. For the test here, the variable resistor was set to 100K measured with a digital multimeter. The piezo element was rated for 3-20v, 10mA. You can make the circuit portable by powering it with a 9v battery, but you might have to adjust the values of the resistors and capacitor for best results. I used a bench power supply and proto-board for my experiments.

To make a sample signal, I connected the signal line to a power supply set at 4.7v, decreased it by hand to 1.25v, then increased it back to 4.7v. The circuit responded by producing slower, faster and slower clicks. If you would like to listen to the recording, there are MP3 and WAV files in the GitHub repository.

My original plan was to use the SystemDialogInput[ “RecordSound” ] command to record the audio directly into Mathematica, but that feature is unfortunately not supported on Macs yet. So I recorded the sample using Audacity instead, and then used the Import command to load the sound file into Mathematica.

My complete Mathematica notebook is available for download on GitHub. Here are the highlights.

We load the signal in as data, and plot every thousandth point to get an idea of what it looks like. Since we sampled the sound at 44100 Hz, there will be 44100 data points for each second of audio, or 44.1 data points per millisecond. The duration of each click is on the order of 10 milliseconds or less, so we won’t be able to make out individual clicks this way.

signalData = Flatten@Import[path <> "sample-data.wav", "Data"];
ListLinePlot[
   signalData[[1 ;; Length[signalData] ;; 1000]],
   ImageSize -> Full, AspectRatio -> 0.25, PlotRange -> All,
   Joined -> False, Filling -> Axis]

Near the beginning of the sample, the clicks are relatively sparse. If we start at the 3 second mark and plot every point for 50 milliseconds of our data, we can see one click.

second=44100;
msec=N[second/1000];
sparseRegion=signalData[[(3*second) ;; (3*second) + Round[50*msec]]];
ListLinePlot[sparseRegion,
    ImageSize -> Full, AspectRatio -> 0.25, PlotRange -> All,
    Joined -> False, Filling -> Axis]

We are interested in pulling out the clicks, so we take the absolute value of our data, then use the MovingAverage command to smooth it a bit. Averaging over ten data points seems to give us a sharp peak.

ListLinePlot[MovingAverage[Abs[sparseRegion], 10],
    ImageSize -> Full, AspectRatio -> 0.25, PlotRange -> All,
    Joined -> False, Filling -> Axis]

Now we can use a thresholding function that returns 1 when the signal is above a given level and 0 otherwise. We use Map to apply the function to each data point.

thresholdFunction[value_, threshold_] :=
    If[value > threshold, 1, 0]
threshold = 0.25;
ListLinePlot[
    Map[
        thresholdFunction[#, threshold] &, 
        MovingAverage[Abs[sparseRegion], 10]], 
    ImageSize -> Full, AspectRatio -> 0.25, PlotRange -> All, 
    Joined -> False, Filling -> Axis]

This looks pretty good when the clicks are infrequent. We need to make sure that it will work in a region of our signal where clicks are more frequent, too. We look at a 50-millisecond-long region 10 seconds into our signal, doing the same processing and plotting the results.

denseRegion=signalData[[(10*second) ;; (10*second) + Round[50*msec]]];

Since our thresholding looks like it will work nicely for our data, we can apply it to the whole dataset.

signalThresholded = 
    Map[thresholdFunction[#, threshold] &, 
        MovingAverage[Abs[signalData], 10]];

Now we need to count the number of clicks that occur in a given interval of time. We will do this by using the Partition command to partition our data into short, overlapping windows. Each window will be 200 milliseconds long, and will overlap those on its left and right by 100 milliseconds each. To count clicks, we will use the Split command to split each window into runs of similar characters (i.e., all 1s or all 0s) then Map the Total function over each of the resulting lists. If there are more than, say, five 1s in a row, surrounded on either side by a run of 0s, we can assume that a click has occurred.

signalThresholdedPartition = 
    Partition[signalThresholded, Round[200*msec], Round[100*msec]];
countClicks[window_] :=
    Total[
        Map[If[# > 5, 1, 0] &, 
            Map[Total, Split[window]]]]

We can now plot the number of clicks per window. This shows us that the voltage starts high, drops, then increases again. Note the spike in the middle, which represents a dense cluster of clicks. Looking at the raw signal, this appears to have been a burst of noise. At this point, we have a workable way to convert a varying voltage into a digital signal that we can analyze with Mathematica. For most applications, the next step would be to calibrate the system by recording audio samples for specific voltages and determining how many clicks per window was associated with each.

ListLinePlot[Map[countClicks, signalThresholdedPartition]]

Categories Mathematica

Text and Image Mining for Historical Research (Wolfram Virtual Technology Conference 2020)

Basic Text Analysis in Mathematica

Introduction

Using a built-in text

Extracting part of a string

From a string to a list of words

Processing each element in a list

Computing word frequencies

Aside: Anonymous Functions

N-grams

Concordance (Keyword in Context)

Removing stop words

Bigrams containing the most frequent words

Document frequencies

TF-IDF: Term frequency-Inverse document frequency

A Simple Algorithm for Finding Images in E-books

Connecting Phidgets to Mathematica on Mac OS X with J/Link

Connecting Arduino to Mathematica on Mac OS X with SerialIO

Social Network Analysis and Visualization

Simple Acoustic Data Acquisition with Mathematica

Projects

Recent

Archives

Categories