This is an open access, open content and open source textbook in the form of a Mathematica notebook. If you do not have Mathematica, you can open the notebook with Wolfram’s free CDF Player software. You will not be able to do everything that you can with the notebook version, but it should be good enough for you to get an idea of what is included.

Second Revised Edition (Summer 2020)


GitHub Repository
Download Mathematica notebook

The second revised edition (v2.01, September 2020) contains 23 complete chapters which cover the following topics. There are also a number of screencasts for each lesson.

Second Edition (Summer 2019)


GitHub Repository
Download Mathematica notebook
Download PDF version

The second edition (v2.0, August 2019) contains 22 complete chapters which cover the following topics.

  • Lesson 01. Reading Code. Word frequency, word clouds and stopwords.
  • Lesson 02. Computable Knowledge. Entities, tables, timelines and maps.
  • Lesson 03. Text Content. Mathematica notebooks and expressions, strings and natural language processing.
  • Lesson 04. Data Structures. Lists, associations and datasets.
  • Lesson 05. Reusing Code. Defining and developing functions, keyword in context (KWIC).
  • Lesson 06. Networks. Metadata, matrices and social network analysis.
  • Lesson 07. Indexing and Searching. Pattern matching, topic classification and term distribution.
  • Lesson 08. Geospatial Analysis. Geographic information: raster, vector and attribute data.
  • Lesson 09. Images. Computer vision, face detection, feature extraction and image mining.
  • Lesson 10. Page Images. Optical character recognition (OCR), figure extraction and classification.
  • Lesson 11. Crawling. Browser automation, batch downloading, web archives and WARC files.
  • Lesson 12. Linked Open Data. Resource description framework (RDF), SPARQL queries and endpoints, JSON-LD.
  • Lesson 13. Markup Languages. Scraping and parsing, XML, really simple syndication (RSS) and text encoding initiative (TEI).
  • Lesson 14. Studying Societies. Computational social science, search data, social media and social networks.
  • Lesson 15. Extracting Keywords. Information retrieval, term frequency-inverse document frequency (TF-IDF) and rapid automatic keyword extraction (RAKE).
  • Lesson 16. Word and Document Vectors. Feature extraction, dimension reduction, word embeddings and global vectors.
  • Lesson 17. Citations. References, web services, bibliographic linked open data and citation networks.
  • Lesson 18. Natural Language. Multilingual analysis, computational linguistics and sentiment analysis.
  • Lesson 19. Web Services. Entity networks, publication search, dashboards, manipulating JSON.
  • Lesson 20. Databases. Parts, selections and transformations, computations and querying, relations.
  • Lesson 21. Measuring Images. Photogrammetry, georectification, handwriting and facial 3D reconstruction.
  • Lesson 22. Machine Learning. Unsupervised clustering, classify, predict and transfer learning.

First Edition (Summer 2015)

Download Mathematica Notebook (.nb, 4.4MB)
Download CDF (.cdf, 14MB)
GitHub Repository

The first edition (v1.0, August 2015) contains six complete chapters which cover the following topics.

  • Analyzing Text: word frequencies, word clouds, characterizing sentences, text search, bag of words representation, keyword in context.
  • Pattern Matching: string patterns, computable word data, stemming, concordance, capitalized words and phrases, n-gram analysis, stop words.
  • Who and What: computable data about people, associations, named entities.
  • When and Where: computable data about events and geospatial entities, timelines, maps, collocations, visualizing cooccurrence, vector distance and similarity.
  • Information Retrieval: document vector model, related records, TF-IDF, document frequencies, summarization, computable subject data.
  • Internet Sources: batch downloading, comparing texts, indexing for search, markup languages, scraping, interactive pattern matching, RSS feeds,

In addition, there are code samples (but no explanatory text) for the following tasks. I hope to expand these to full chapters in future editions.

  • Image Processing: PDFs, optical character recognition, visualizing page images, automatic image extraction, detecting faces, photogrammetry, georectification, image classification and identification.
  • Spidering and APIs: Wikipedia, network graphing, clustering, Internet Archive, OCLC WorldCat Identities, Open Library API, JSTOR Data for Research

A set of accompanying slides will be posted weekly from September through early December 2015 at

These cover many of the techniques from the textbook in slightly simplified form, focussing more on the use of the techniques than the underlying code.