Archives for category: Method

Text and Image Mining for Historical Research (Wolfram Virtual Technology Conference 2020)

2021/08/23 //

Scholars in history and other humanities tend to work differently than their colleagues in science and engineering. Working alone or occasionally in small groups, their focus is typically on the close reading of sources, including text and images. Mathematica’s high-level commands and holistic approach make it possible for a single programmer to implement relatively complex research tools that are customized for a particular study. Here William J Turkel describes a few examples, including the mining and conceptual analysis of text, and image mining applied to electronic schematics, photographs of bridges and the visual culture of stage magic. He also discuss the opportunities and challenges of teaching these kinds of skills to university students with little or no technical background.

Categories Mathematica, Method, Uncategorized

Close Reading: Historians, Detectives and Spies

2019/11/01 //

Here are some references and links for seminars that I conducted for my department’s “High School History Day” in November 2019.

The conceit of historians as detectives is very common in the field. By far my favorite exploration of historical detection is the essay “Clues” by Carlo Ginzburg, which appears in his collection Clues, Myths and the Historical Method (Baltimore: Johns Hopkins University Press, 1989). One book that many of us have on our shelves is Robin W. Winks, ed. The Historian as Detective: Essays on Evidence (New York: Harper & Row, 1969). Winks also authored a book on the relations between the intelligence community and the university, Cloak and Gown: Scholars and the Secret War, 1939-1961 (2nd ed, New Haven: Yale University Press, 1996). A more recent collection flips the premise, looking at what we can learn about the past by reading historical crime fiction: Ray B. Browne & Lawrence A. Kreiser, Jr., eds., The Detective as Historian: History and Art in Historical Crime Fiction (Madison: University of Wisconsin, 2000). Peirce’s three kinds of inference (and their connection to detective literature) are the subject of Umberto Eco & Thomas A. Sebeok, eds., The Sign of the Three: Dupin, Holmes, Peirce (Bloomington: Indiana University Press, 1988).

Determining when and where a picture was taken is one kind of verification task. The Bellingcat website has links to many resources, including daily quizzes. The two examples of Pence’s ‘historic journey’ on Twitter come from an article by Nicole Dahmen. The photo of the man in the pit of water comes from this New York Times article. Hany Farid has made a career of developing sophisticated techniques for authenticating digital images. His new text Photo Forensics (Cambridge, MA: MIT Press, 2016) is a wonderful resource. The Lee Harvey Oswald example is discussed in this news article.

The example of all of the things that one can infer about a society from a single coin was adapted from Louis Gottschalk, Understanding History: A Primer of Historical Method (New York: Alfred A. Knopf, 1950). The map of 19th century shipping comes from the work of digital historian Ben Schmidt on the US Maury collection of the government’s database of ship’s paths. The example of finding Paul Revere from metadata comes from a clever and accessible blog post by sociologist Kieran Healy.

John North’s incredibly detailed analysis of Hans Holbein’s painting The Ambassadors appears in The Ambassadors’ Secret: Holbein and the World of the Renaissance (New ed, London: Phoenix, 2004). The still undeciphered Voynich manuscript is in the Beinicke Rare Book and Manuscript Library at Yale University.

Categories Method

Some Computational History Links

2019/04/02 //

Here are some links for Spring 2019 talks on computational history that I gave at the Fields Institute and MIT.

Sites that can be used with no prior programming experience:

Gavagai Living Lexicon
IFTTT (If This Then That) for automating workflow
MemeTracker and NIFTY for visualizing the 24-hour news cycle
The Programming Historian for novice-friendly, peer-reviewed tutorials to get started with programming
Interactive TF-IDF at Wolfram Demonstations
Webrecorder.io to capture a website in a WARC file that can be browsed later
Wolfram Alpha for natural language queries of a computable knowledge database

If you are comfortable with scripting:

Build a Mini Search Engine with Apache Nutch and Solr
Build an Elasticsearch Search Engine for E-books in a Docker Container
CommonCrawl.org provides access to years of free web crawl data
YAGO provides structured access to ~120M facts concerning ~10M entities, derived from Wikipedia, WordNet and GeoNames

Technical sources:

Achlioptas, “Database-Friendly Random Projections“
Jurgens & Stevens, “Event Detection in Blogs using Temporal Random Indexing“
Kanhabua, Nguyen & Niederée, “What Triggers Human Remembering of Events”
Leskovec, Rajaraman & Ullman, Mining of Massive Datasets
Schmidt, “Stable Random Projection“

Historiography:

Brugger, The Archived Web (2019)
Hartog, Regimes of Historicity (2016)
Milligan, History in the Age of Abundance? (2019)
Snyder, The Road to Unfreedom (2018)
Tooze, Crashed (2018)

Categories Method

A Simple Algorithm for Finding Images in E-books

2012/11/09 //

In September, Tim Hitchcock and I had a chance to meet with Adam Farquhar at the British Library to talk about potential collaborative research projects. Adam suggested that we might do something with a collection of about 25,000 E-books. Although I haven’t had much time yet to work with the sources, one of the things that I am interested in is using techniques from image processing and computer vision to supplement text mining. As an initial project, I decided to see if I could find a way to automatically extract images from the collection.

My first thought was that I might be able to identify text based on its horizontal and vertical correlation. Parts of the image that were not text would then be whitespace, illustration or marginalia. (One way to do this in Mathematica is to use the ImageCooccurence function). As I was trying to figure out the details, however, I realized that a much simpler approach might work. Since the method seems promising I decided to share it so that other people might get some use out of it (or suggest improvements).

In a British Library E-book, each page has a JPEG page image and an associated ALTO (XML) file which contains the OCRed text. The basic idea is to compare the JPEG image file size with the ALTO file size for the same page. Pages that have a lot of text (and no images) should have large ALTO files relative to the size of the JPEG. Pages with an image but little or no text should have a large JPEG relative to the size of the ALTO file. Blank pages should have relatively small JPEG and ALTO files.

The graph below shows the results for an E-book chosen at random from the sample. Textual pages cluster, pages with images tend to cluster, and blank pages (and covers) fall out along one axis because they have no text at all. We can use more sophisticated image processing and machine learning to further subdivide images and extract them once they are located, but this seems pretty good for a first pass.

Categories Mathematica, Method

Measure and Refactor Constantly

2011/04/05 //

Let me wrap up this series of posts by suggesting that the most important aspect of the method is to practice what is called kaizen in Japanese: make continuous small improvements. Part of this is simply a willingness to keep tweaking something even while it is working. The other part is an ability to measure the effects of the changes that you do make. When I started looking at the word counts for my writing from day-to-day, I realized that the kind of music that I was listening to made a difference. So I started systematically buying songs at iTunes and seeing what kind of impact they had. When I found something that increased my productivity, I used the ‘Genius’ feature to find related songs and added them to my playlist. (iTunes: “Want to tell your friends about this? Connect with them on Ping.” Me: “Are you crazy? My friends would mock me until I cried. I can barely admit to myself that I spent 99 cents on that song.”) But if listening to terrible music allows me to write 14% more each day, I have one extra day each week to work on something else.

Programmers use the verb refactor to describe the process of taking apart something that is working, optimizing the pieces, and putting it back together. For a while I used DevonThink to organize all of my notes, but I realized that the features that made it work so well for a focused research project also made it too cumbersome to handle the minutiae of day-to-day life. So I’m now using OmniFocus (and Getting Things Done) to keep track of the things that I have to do, and using Evernote to collect random notes on everything that is not part of an existing research project or course that I’m teaching. ProfHacker has a number of good posts about things that you can do with Evernote (1, 2, 3). For me, the main advantages are that it is lightweight, cloud-based and accessible from almost every computing device I own.

Categories Method

Write and Cluster Small Texts

2011/04/04 //

So far we’ve discussed creating a backup system, a local archive of OCRed digital sources and a bibliographic database. We’ve also covered a number of strategies for adding new material to our local repository automatically, including search, information trapping, and spidering. Two programs lie at the core of the workflow, DevonThink Pro and Scrivener. When I add a new source to my local repository, I create a bibliographic record for it, then I index it in DevonThink. Once it is indexed I can use the searching, clustering and concordance tools in DevonThink to explore my sources. Since everything that is related to my project is indexed in DT, it is the one local place where I go to look for information. Rather than explain all of the useful things that can be done with your sources in DT (and there are a lot) I will simply refer to the articles by Steven Johnson and Chad Black that convinced me to give DT a try:

Steven Johnson, Tool for Thought and More Information
Steven Johnson, DIY: How to Write a Book
Chad Black, DevonThink and other Mac Apps for History and Humanities Research, DevonThink for Historical Research, Part II, DevonThink for Historical Research, Part III, and Update on the Ever-Changing Method

A couple of notes about DT. If you do decide to buy the software, buy the $10 e-book on how to use the software at the same time. There are a lot of powerful features and this is the fastest way to learn them. Unlike Johnson, I store everything related to my project in DT. That is why I advise bursting large documents into smaller units for better searching and clustering. As a historian, I also tend to write books that have more chronological structure than his do, so I use a sheet in DevonThink to create a chronology that can be sorted by date, name, event, and source. It is not as flexible as a spreadsheet, but it does not have to be.

For writing, I am using Scrivener, which allows me to draft and rearrange small units of text easily. I can copy a passage that I’ve just written from Scrivener into DT and use the magic hat to find anything in my sources that may be related to that paragraph (just as Johnson describes). The super-secret monograph consists of seven chapters of five subsections each. In Scrivener, I can see the status of each of those pieces at a glance: how much has been written, how much needs to be done, and how each relates to other subsections. Rather than facing a yawning gulf at the bottom of a word processer screen, and potential writer’s block, I can see the book take shape as I work on it. When the draft is finished in Scrivener, it is easily compiled to whatever form that the publisher wants. I can’t say enough good things about Scrivener. By itself I’m sure it has doubled my writing speed.

The key is working with small enough units of text. When you cite a source, you are typically referring to a part of it: a quote, a paragraph, a passage, an argument. Similarly, when you write a book, you can only work on a small part of it at one time. Inappropriate tools force you to manipulate objects at the wrong scale, usually the whole document, however long. Better tools, such as DT and Scrivener, allow you to focus on exactly the pieces you need for the task at hand.

Categories Method

Burst Documents for Fine-Grained Matching

2011/03/27 //

If you are collecting local copies of documents that you haven’t read yet, you need to use computer programs to search through and organize them. Since all text is OCRed, it is easy enough to use Spotlight to do the searching (or a program like DevonThink: more on this in a future post). The problem is that it does not do you much good to find out that your search term(s) appear in a book-length document. If you have a lot of reference books in your local digital archive, you will get many hits for common search terms. If you have to open each reference and look at each match, you are wasting your time. Admittedly, it is faster than consulting the index of a paper book and flipping to the relevant section. The process can be greatly sped up, however.

The trick is to take long PDFs and “burst” them into single pages. That way, when you search for terms in your local archive, your results are in the form

Big Reference Book, p 72
Interesting Article, p 7
Interesting Article, p 2
Another Article, p 49
Big Reference Book, p 70
Interesting Article, 1

rather than

Interesting Article (1, 2, 7)
Big Reference Book (70, 72)
Another Article (49)

The first advantage is that you don’t have to open each individual source (like Big Reference Book) and, in effect, repeat your search to find all instances of your search terms within the document. The second advantage is that the search algorithm determines the relevance of each particular page to your search terms, rather than the source as a whole. So the few valuable hits in a long, otherwise irrelevant document are surfaced in a convenient manner. Sources that are on-topic (like “Interesting Article” in the example above), will show up over and over in your results list. If you decide you want to read the whole source, you can open the original PDF rather than opening the individual page files.

It is easy enough to write a short script to burst PDFs if you know how to program. If you don’t, however, you can use an Automator workflow to accomplish the same results. If you’d like, you can even turn your bursting workflow into a droplet. I’ve attached a sample workflow to get you started.

Categories Method

Spider to Collect Sources

2011/03/22 //

Once you start collecting large numbers of digital sources by searching for them or using an information trapping strategy, you will find that you are often in the position of wanting to download a lot of files from a given site. Obviously you can click on the links one at a time to make local copies, but that loses one of the main advantages of working with computers–letting them do your work for you. Instead, you should use a program like DownThemAll (a Firefox extension), SiteSucker (a standalone program) or GNU wget (a command line tool that you call from the terminal). Each of these automates what would otherwise be a very repetitive and thankless task. If you have never tried this before, start by getting comfortable with one of the more user-friendly alternatives like DownThemAll or SiteSucker, then move to wget when you need more power. You can also write your own custom programs to harvest sources, of course. There is an introductory lesson on this in the Programming Historian.

Now imagine that you have a harvester that is not only capable of downloading files, but of extracting the hyperlinks found on each page, navigating to the new pages and downloading those also. This kind of program is called a spider, crawler or bot. Search engine companies make extensive use of web spiders to create a constantly updated (and constantly out-of-date), partial map of the entire internet. For research, it is really nice to be able to spider more limited regions of the web in search of quality sources. Again, it is not too difficult to write your own spiders (see Kevin Hemenway and Tara Calishain’s Spidering Hacks) but there are also off-the-shelf tools which will do some spidering for you. In addition to writing my own spiders, I’ve used a number of these packages. Here I will describe DevonAgent.

DevonAgent includes a web browser with some harvesting capabilities built in. You can describe what you are looking for with a rich set of operators like “(Industrial BEFORE Revolution) AND (India NEAR/10 Textile*)”. Results are displayed with both relevance ranking and with an interactive graphical topic map that you can use to navigate. You can keep your results in an internal archive or export them easily to DevonThink. (More on this in a future post). You can schedule searches to run automatically, thus extending your information trapping strategy. DevonAgent also has what are called “scanners”: filters that recognize particular kinds of information. You can search, for example, for pages that contain PDFs, e-mail addresses, audio files, spreadsheets or webcams. You can also specify which URLs your spider will visit (including password protected sites). DevonAgent comes with about 80 plugins for search engines and databases, including sites like Google Scholar, IngentaConnect, the Internet Archive and Project Gutenberg. You can also write your own plugins in XML.

DevonAgent allows you to follow links automatically and to set the depth of the search. If you were looking at this blog post with DevonAgent, a Level 1 search would also retrieve the pages linked to by this post (for DownThemAll, etc.), it would retrieve some other pages from my blog, and so on. A Level 2 search would retrieve all of the stuff that a Level 1 search gets, plus the pages that the DownThemAll page links to, the pages linked to in some of my other blog posts, and so on. Since a spider is a program that you run while you are doing something else, it is OK if it goes down a lot of blind alleys in order to find things that are relatively rare. Learning where to start and how to tune the depth of your search is an essential part of using spidering for your research. DevonAgent will use Growl to notify you when its search is complete. (If there is something that I am eagerly awaiting, I also use Prowl to get Growl notifications when I’m away from my computer. But you may find that’s too much of a good thing.)

Categories Method

Going Digital

2011/03/15 //

If you are just getting started with online research, there are some things that are handy to know, and a few tools you might like to set up for yourself.

Analog and digital. When I talk to my students about the difference between analog and digital representations, I use the example of two clocks. The first is the kind that has hour and minute hands, and perhaps one for seconds, too. At some point you learned how to tell time on an analog clock, and it may have seemed difficult. Since the clock takes on every value in between two times, telling time involves a process of measurement. You say, “it’s about 3:15,” but the time changes continuously as you do so. Telling time with a digital clock, by contrast, doesn’t require you to do more than read the value on the display. It is 3:15 until the clock says it is 3:16. Digital representations can only take on one of a limited (although perhaps very large) number of states. Not every digital representation is electronic. Writing is digital, too, in the sense that there are only a finite number of characters, and instances of each are usually interchangeable. You can print a book in a larger font, or in Braille, without changing the meaning. Salvador Dalí’s melting clocks, however, would keep different time–which was the point, of course.

The costs are different. Electronic digital information can be duplicated at near-zero cost, transmitted at the speed of light, stored in infinitesimally small volumes, and created, processed and consumed by machines. This means that ideas that were more-or-less serviceable in the world before networked computers–ideas about value, property rights, communication, creativity, intelligence, governance and many other aspects of society and culture–are now up for debate. The emergence of new rights regimes (such as open access, open content and open source) and the explosion of new information are manifestations of these changing costs.

You won’t be able to read everything. Estimates of the amount of new information that is now created annually are staggering (2003, 2009). As you become more skilled at finding online sources, you will discover that new material on your topic appears online much faster than you can read it. The longer you work on something, the more behind you will get. This is OK, because everyone faces this issue whether they realize it or not. In traditional scholarship, scarcity was the problem: travel to archives was expensive, access to elite libraries was gated, resources were difficult to find, and so on. In digital scholarship, abundance is the problem. What is worth your attention or your trust?

Assume that what you want is out there, and that you simply need to locate it. I first found this advice in Thomas Mann’s excellent Oxford Guide to Library Research. Although Mann’s book focuses primarily on pre-digital scholarship, his strategies for finding sources are more relevant than ever. Don’t assume that you are the best person for the job, either. Ask a librarian for help. You’ll find that they tend to be nicer, better informed, more helpful and more tech savvy than the people you usually talk to about your work. Librarians work constantly on your behalf to solve problems related to finding and accessing information.

The first online tool you should master is the search engine. The vast majority of people think that they can type a word or two into Google and choose something from the first page of millions of results. If they don’t see what they’re looking for, they try a different keyword or give up. When I talk to scholars who aren’t familiar with digital research, their first assumption is often that there aren’t any good online resources for their subject. A little bit of guided digging often shows them that this is far from the truth. So how do you use search engines more effectively? First of all, sites have an advanced search page that lets you focus in on your topic, exclude search terms, weight some terms more than others, limit your results to particular kinds of document, to particular sites, to date ranges, and so on. Second, different search engines introduce different kinds of bias by ranking results differently, so you get a better view when you routinely use more than one. Third, all of the major search engines keep introducing new features, so you have to keep learning new skills. Research technique is something you practice, not something you have.

Links are the currency of the web. Links make it possible to navigate from one context to another with a single click. For human users, this greatly lowers the transaction costs of comparing sources. The link is abstract enough to serve as means of navigation and able to subsume traditional scholarly activities like footnoting, citation, glossing and so on. Furthermore, extensive hyperlinking allows readers to follow nonlinear and branching paths through texts. What many people don’t realize is that links are constantly being navigated by a host of artificial users, colorfully known as spiders, bots or crawlers. A computer program downloads a webpage, extracts all of the links and other content on it, and follows each new link in turn, downloading the pages that it encounters along the way. This is where search engine results come from: the ceaseless activity of millions of automated computer programs that constantly remake a dynamic and incomplete map of the web. It has to be this way, because there is no central authority. Anyone can add stuff to the web or remove it without consulting anyone else.

The web is not structured like a ball of spaghetti. Research done with web spiders has shown that a lot of the most interesting information to be gleaned from digital sources lies in the hyperlinks leading into and out of various nodes, whether personal pages, documents, archives, institutions, or what have you. Search engines provide some rudimentary tools for mapping these connections, but much more can be learned with more specialized tools. Some of the most interesting structure is to be found in social networks, because…

Emphasis is shifting from a web of pages to a web of people. Sites like Blogger, WordPress, Twitter, Facebook, Flickr and YouTube put the emphasis on the contributions of individual people and their relationships to one another. Social searching and social recommendation tools allow you to find out what your friends or colleagues are reading, writing, thinking about, watching, or listening to. By sharing information that other people find useful, individuals develop reputation and change their own position in social networks. Some people are bridges between different communities, some are hubs in particular fields, and many are lonely little singletons with one friend. This is very different from the broadcast world, where there were a few huge hubs spewing to a thronging multitude (known as “the audience“).

Ready to jump in? Here are some things you might try.

Customize a browser to make it more useful for research

Install Firefox
Add some search extensions
Pull down the search icon and choose “Manage Search Engines” -> “Get more search engines” then search for add-ons within Search Tools
Try search refinement
1. Install the Deeper Web add-on and try using tag clouds to refine your search
2. Example. Suppose you are trying to find out more about a nineteenth-century missionary named Adam Elliot. If you try a basic Google search, you will soon discover there is an Australian animator with the same name. Try using Deeper Web to find pages on the missionary.
Add bookmarks for advanced search pages to the bookmark toolbar
1. Google Books (note that you can set to “full view only” or “full view and preview”
2. Historic News (experiment with the timeline view)
3. Internet Archive
4. Hathi Trust
5. Flickr Commons (try limiting by license)
6. Wolfram Alpha (try “China population”)
7. Google Ngram Viewer
You can block sites that have low quality results

Work with citations

Install Zotero
1. Try grabbing a record from the Amazon database
2. Use Zotero to make a note of stuff you find using Search Inside on your book
3. Under the info panel in Zotero, use the Locate button to find a copy in a library near you
4. From the WorldCat page, try looking for related works and saving a whole folder of records in Zotero
5. Find the book in a library and save the catalog information as a page snapshot
6. Go into Google Books, search for full view only, download metadata and attach PDF
Learn more about Zotero
Explore other options for bibliographic work (e.g. Mendeley)

Find repositories of digital sources on your topic

Here are a few examples for various kinds of historical research.

Canada
US
UK and Europe
Thematic (e.g., history of science and medicine)

Capture some RSS feeds

Install the Sage RSS feed reader in Firefox (lots of other possibilities, like Google Reader)
1. Go to H-Net and choose one or more lists to monitor; drag the RSS icons to the Sage panel and edit properties
2. Go to Google News and create a search that you’d like to monitor; drag RSS to Sage panel and edit properties
3. Go to Hot New Releases in Expedition and Discovery at Amazon and subscribe to RSS feed
Consider signing up for Twitter, following a lot of colleagues and aggregating their interests with Tweeted Times (this is how Digital Humanities Now works)

Discover some new tools

The best place to start is Lisa Spiro’s Getting Started in the Digital Humanities and Digital Research Tools wiki

Categories Method

Make Local Copies of Everything

2011/03/07 //

The digital world is a world of abundance. The most scarce resource in your research process is always going to be your own time. Ideally, then, you only want to pay attention to those things which absolutely require it–everything else should be handed off to computer programs. The digital world is also a plastic world, where anything can change without warning. Bookmarks are of little value, and the last thing you want to do is engage in a long and fruitless online search for something that you know you found once.

What this means in practical terms is that whenever you look at something, you want your computer to have a copy of what you’ve seen. If you need to look at a source again to cite it, or quote it, or reread it in the light of new information, that source should be instantly ready-to-hand. You can accomplish this by always saving a local, digital copy of everything you read, along with basic metadata: who created it? where did you find it? when did you look at it?

If you read something on paper, digitize and OCR it, then add the metadata to a bibliographic database. There are many options here. You can use Zotero in the Firefox browser on any platform. On the Mac, Sente, Bookends and Mendeley are all popular choices. (I’ve used all of these in my own research. Each has advantages and disadvantages, so you might have to try a few before settling on something that really works for you. The main point is that you need a bibliographic database, so choose one). If you look at something online, download a copy and OCR it if necessary. If you are browsing the web, take a screenshot of the information you want to keep, and OCR it. The Mac already has excellent built-in screen capturing abilities, but I also use LittleSnapper because it gives me more flexibility in what I save.

I like to keep all of my local copies of documents as PDFs because the format keeps the text and image layer of the document together. I use a consistent file naming convention so that I know what a source should be called if I already have a copy of it. Dates in filenames are always YYYYMMDD, so the files get sorted in chronological order when I look at the contents of a folder. I don’t use capital letters, spaces, or any punctuation other than dot, dash and underscore in my filenames. This makes it easier to write programs to process whole batches of files. (It is possible to write a program to parse any filename, of course, but who needs extra hassle?)

Categories Method

Text and Image Mining for Historical Research (Wolfram Virtual Technology Conference 2020)

Close Reading: Historians, Detectives and Spies

Some Computational History Links

A Simple Algorithm for Finding Images in E-books

Measure and Refactor Constantly

Write and Cluster Small Texts

Burst Documents for Fine-Grained Matching

Spider to Collect Sources

Going Digital

Make Local Copies of Everything

Projects

Recent

Archives

Categories