When I was writing my dissertation in 2003, Libraries and Archives Canada put the Dictionary of Canadian Biography online, and made it freely available to scholars. Since I was in the US at the time, this made my life much easier. Instead of taking the subway to a library that had a copy of the DCB every time I needed to look someone up, I could now get biographical information without interrupting my writing.

The online DCB has many other advantages over the print edition, however. For one thing, the entire text can be searched for keywords. If you are interested in a relatively obscure place that may no longer exist, you can immediately find the biographies that mention that place. If you search for “Fort Chilcotin,” for example, you will only find one match, “Klatsassin.” Most keywords that appear very infrequently will not make it into a printed index, making them almost impossible to find without full-text searching.

Another advantage of online information is that it can often be made even more useful with a little bit of web programming. (For more on this idea see “Teaching Young Historians to Search, Spider and Scrape.”) Thus the first of our digital history hacks.

On the advanced search page of the online DCB, it is possible to click on a volume number, geographical region, gender, or “identification” to see how many biographies match that category. Doing this shows that there are, for example, 450 biographies of females and 7,548 biographies of males. It is also possible to combine categories. There are 15 biographies of female aboriginal people and 229 biographies of male aboriginal people. Exploring the search page in such a desultory fashion can tell you a lot about Canadian historiography. Wouldn’t it be nice to be able to automate this exploratory process?

This hack scrapes the search page to extract the codes for each of the identification categories, then ‘clicks’ each category and grabs the number of matching biographies. The results are then presented as a “tag cloud,” a representation where the font size is proportional to the number of hits. The code for the hack was written in Perl and is available on GitHub.

# dcbo-ids-cloud.pl
# 15 jan 2006
#
# wj turkel
# http://digitalhistoryhacks.blogspot.com
#
# Goes to the online Dictionary of Canadian Biography to get the
# number of people in each category ('Aboriginal', 'Accountant', etc.)
# Outputs a tag cloud as HTML.
#
# LWP code adapted from
#   Burke, Perl & LWP (O'Reilly 2002), pp. 27-28, 96-97.
# Tag cloud adapted from
#   Bausch, Yahoo! Hacks (O'Reilly 2006), pp. 203-04.
# Max subroutine from
#   Schwartz & Phoenix, Learning Perl, 3rd ed (O'Reilly 2001), pp. 65.

use LWP;
use LWP::Simple;
use POSIX "floor";

sub max {
    my($max_so_far) = shift @_;
    foreach(@_) {
        if ($_ > $max_so_far) {
            $max_so_far = $_;
        }
    }
    $max_so_far;
}

my $browser;
sub do_POST {
    $browser = LWP::UserAgent->new() unless $browser;
    my $resp = $browser->post(@_);
    return ($resp->content, $resp->status_line, $resp->is_success, $resp)
        if wantarray;
    return unless $resp->is_success;
    return $resp->content;
}

my $doc_url = 'http://www.biographi.ca/EN/Search.asp';

# Create a hash of category IDs and names by scraping base page.
my %categories = ();
my $document = get($doc_url);
while ($document =~ m/<a class="NormalLink" href="Javascript:fSubmit\('','','','','([0-9]+)','1','','','','','',''\);">(.*?)<\/A>/g) {
 my ($id, $tmp) = ($1, $2);
 $tmp =~ s///;
 $categories{$id} = $tmp;
}

# We need to keep the category keys in the right order.
my @catarray = sort {$categories{$a} cmp $categories{$b}} (keys %categories);

# For each different category, return count of matching biographies.
my %categorycount = ();
foreach my $id (@catarray) {
 my ($content, $message, $is_success) = do_POST(
 $doc_url,
 [ 'Data3' => $id,
 'Data4' => '1' ],
 );
 die "Error $message\n"
 unless $is_success;
 $content =~ m{
<strong>([0-9,]+)</strong> biography\(ies\) are available using your current search criteria}is;
 my $tmp = $1;
 $tmp =~ s/\,//;
 $categorycount{$id} = $tmp;
 # Be considerate to their server.
 sleep 2;
}

# Debugging scaffolding: check this output against tag cloud.
print "\n----------------------------\n";
foreach my $key (@catarray) {
 print "key: " . $key . "\tcat: " . $categories{$key} . "\tcnt: " . $categorycount{$key} . "\n";
}
print "\n----------------------------\n";

# Now we send the tag cloud to an HTML file.
open(OUTPUT, ">ID-cloud.html");

# Range of font sizes to use.
my $minfontsize = 12;
my $maxfontsize = 36;

# Get the maximum number of biographies in any category.
my $maxbio = &max(values %categorycount);

# Output the opening HTML tags.
print OUTPUT "\n\n\n\n\n";
print OUTPUT "
\n\n</a></pre>
<table width="80%" border="1px" cellpadding="4px" align="center">
<tbody>
<tr>
<td>
\n";

# Print the name of each category in the appropriate sized font.
foreach my $catid (@catarray) {
 my $fontsize = $minfontsize + floor(($maxfontsize-$minfontsize) * ($categorycount{$catid}/$maxbio));
 print OUTPUT "<span class="\'tag\'" style="1font-size: &quot;;">" . $categories{$catid} . "</span>\n";
}

# Output the closing HTML tags.
print OUTPUT "</td>
</tr>
</tbody>
</table>
<pre><a class="NormalLink" href="Javascript:fSubmit\('','','','','([0-9]+)','1','','','','','',''\);">
\n\n\n";

print "Finished processing.\n";

The tag cloud of entries in the DCB looks like this:

Now what do we see? The vast majority of people in the DCB are businessmen, office holders, politicians, lawyers and soldiers. This, too, says a lot about Canadian historiography. It also suggests a new question: how do the categories change over time? That’s another hack for another day.

Tags: dictionary of canadian biography | digital history | hacking | perl | tag cloud

Revision History

  • 2006-01-15. Published at http://digitalhistoryhacks.blogspot.ca/2006/01/who-is-in-dictionary-of-canadian.html
  • 2008-09-26. Link to code updated to point to a wiki archive
  • 2012-04-17. Revised edition published on http://williamjturkel.net. Image of tag cloud moved to this blog. Blog-internal links revised to point to this site. Link to code updated to point to GitHub repository. Code also displayed inline.