This is an open access, open content and open source textbook in the form of a Mathematica notebook. If you do not have Mathematica, you can download the .CDF version of the text and open it with Wolfram’s free CDF Player software. You will not be able to do all of the computations that you can with the notebook version, but it should be good enough for you to get an idea of what is included.
The first edition (v1.0, August 2015) contains six complete chapters which cover the following topics.
- Analyzing Text: word frequencies, word clouds, characterizing sentences, text search, bag of words representation, keyword in context.
- Pattern Matching: string patterns, computable word data, stemming, concordance, capitalized words and phrases, n-gram analysis, stop words.
- Who and What: computable data about people, associations, named entities.
- When and Where: computable data about events and geospatial entities, timelines, maps, collocations, visualizing cooccurrence, vector distance and similarity.
- Information Retrieval: document vector model, related records, TF-IDF, document frequencies, summarization, computable subject data.
- Internet Sources: batch downloading, comparing texts, indexing for search, markup languages, scraping, interactive pattern matching, RSS feeds,
In addition, there are code samples (but no explanatory text) for the following tasks. I hope to expand these to full chapters in future editions.
- Image Processing: PDFs, optical character recognition, visualizing page images, automatic image extraction, detecting faces, photogrammetry, georectification, image classification and identification.
- Spidering and APIs: Wikipedia, network graphing, clustering, Internet Archive, OCLC WorldCat Identities, Open Library API, JSTOR Data for Research
A set of accompanying slides will be posted weekly from September through early December 2015 at
These cover many of the techniques from the textbook in slightly simplified form, focussing more on the use of the techniques than the underlying code.