In previous posts we downloaded a single book from the Internet Archive, calculated word frequencies, searched through it with regular expressions, and created a permuted term index. In this post, we extend our command line methods to include automatically downloading an arbitrarily large batch of files and building a simple search engine for our collection of sources.
In order to download a batch of files from the Internet Archive, we need a search term that will work on the advanced search page of the site. For the example here, I am going to be using the search
collection:gutenberg AND subject:"Natural history -- Juvenile literature"
There is a very nice post on the Internet Archive blog explaining the process in detail for a Mac or Windows machine. Here we will do everything at the Linux command line. For our trial we will only be using seven books, but the same method works just as well for hundreds or thousands of sources.
URL Encoding and the HTTP GET Method
First, a quick review of URL encoding and the HTTP GET method. Files on a web server are are stored in a nested directory structure that is similar to the Linux / UNIX filesystem. To request a file, you have to give your web browser (or a program like wget) a URL, or uniform resource locator. Think of this like the address of a file. It starts with a message telling the server what protocol you want to use to communicate (e.g., HTTP). This is followed by the name of the host (typically a domain name like archive.org), an optional port number (which we don’t need to deal with here) and then the path to the resource you want. If the resource is a file sitting in a directory, the path will look a lot like a file path in Linux. For example,
For many sites, however, it is possible to send a custom query to the web server and receive some content as a result. One way of doing this is with the HTTP GET method. Your file path includes a query string like
and the server responds appropriately. (The exact query that you send will depend on the particular web server that you are contacting.)
Regardless of whether you are requesting a single file or sending an HTTP GET, there has to be a way of dealing with blank spaces, punctuation and other funky characters in the URL. This is handled by URL encoding, which converts the URL into a form which can be readily sent online.
When it is URL encoded, the query string that we are going to send to the Internet Archive
collection:gutenberg AND subject:"Natural history -- Juvenile literature"
We will see one way to do this URL encoding below. In the meantime, if you would like to use a browser to see which files we are going to be batch downloading, you can see the search results here. Make sure to look at the URL in your browser’s address bar.
Using cat to Build a Query String
The cat command gives us one quick way to create small files at the Linux command line. We want a file called beginquery that contains the following text
To get that, we can enter the following command, type in the line of text that we want, press Return/Enter at the end of the line, then hit control-c
cat > beginquery
Now you can use
to make sure the file looks like it should. If you made a mistake, you can use sed to fix it, or delete the file with rm and start again. Using the same method, create a file called endquery which contains the following, and check your work.
We’ve created the beginning and end of a query string we are going to send to the Internet Archive using wget. We still need to URL encode the query itself, and then insert it into the string.
For URL encoding, we are going to use a slick method developed by Ruslan Spivak. We use the alias command to create a URL encoder with one line of Python code. (I won’t explain how this works here, but if you would like to learn more about Python programming for humanists, there are introductory lessons at the Programming Historian website.)
At the command line you can enter the following, then type alias to check your work.
alias urlencode='python -c "import sys, urllib as ul; print ul.quote_plus(sys.argv)"'
If you made a mistake, you can remove the alias with
and try again. If your urlencode alias is OK, you can now use it to create a query string for the IA.
urlencode 'collection:gutenberg AND subject:"Natural history -- Juvenile literature"' > querystring
We then use cat to put the three pieces of our query together, and check our work.
cat beginquery querystring endquery | tr -d '\n' | sed '$a\' > iaquery
Note that we had to remove the newlines from the individual pieces of our query, then add one newline at the end. This is what the tr and sed commands do in the pipeline above. Use cat to look at iaquery and make sure it looks OK.
If you want to download sources for a number of different Internet Archive searches, you just need to create new querystring and iaquery files for each.
Downloading a Batch of Files with wget
When we use wget to send the query that we just constructed to the Internet Archive, their webserver will respond with a list of item identifiers. The -i option to wget tells it to use the query file that we just constructed, and the -O option tells it to put the output in a file called iafilelist. We then clean up that file list by deleting the first line and removing quotation marks, as follows
wget -i iaquery -O iafilelist cat iafilelist | tr -d [:punct:] | sed '1d' > iafilelist-clean
We now have a list of files that we want to download from the Internet Archive. We will run wget again to get all of these files. Read the original post on the Internet Archive blog to learn more about the wget options being used. We create a directory to hold our downloads, then download the seven books. This only takes a few seconds.
mkdir download cd download wget -r -H -nc -nd -np -nH --cut-dirs=2 -A .txt -e robots=off -l1 -i ../iafilelist-clean -B 'http://archive.org/download/'
I’ve chosen to work with a small batch of texts here for convenience, but essentially the same techniques work with a huge batch, too. If you do download a lot of files at once, you will probably want to remove the -nd option, so wget puts each source in a directory of its own. It is also very important not to hose other people’s web servers. You can learn more about polite uses of wget at the Programming Historian website.
When we downloaded the books, we ended up with a lot of metadata files and alternate copies. We will save the ones that we want and delete the rest. Look through the downloaded files with
ls -l | less
then clean up the metadata files and the other versions of the text
mkdir metadata mv ?????.txt_meta.txt metadata/ mv ?????-0.txt_meta.txt metadata/ ls metadata
there should be seven files in the download/metadata directory. Now you can get rid of the rest of the stuff in the download directory we won’t be using.
rm *meta.txt rm pg*txt rm *-8.txt
We are left with text versions of our seven books in the download directory and the metadata files for those versions in the download/metadata directory. Use less to explore the texts and their associated metadata files.
Bursting Texts into Smaller Pieces
Note that we can use grep on multiple files. In the download directory, try executing the following command.
egrep "(tree|squirrel)" * | less
This will give us a list of all of the lines in our seven books where one or both of the search terms appears. It is not very useful, however, because we don’t have much sense of the larger context in which the term is situated, and we don’t know how relevant each instance is. Is it simply a passing mention of a tree or squirrel, or a passage that is about both? To answer that kind of query we will want to build a simple search engine for our collection of sources.
The first step is to burst each book into small pieces, each of which will fit on one screen of our terminal. The reason that we are doing this is because it won’t do us much good to find out, say, that the book Friends in Feathers and Fur mentions squirrels and trees somewhere. We want to see the exact places where both are mentioned on a single page.
In Linux we can use the split command to burst a large file into a number of smaller ones. For example, the command below shows how we would split a file called filename.txt into pieces that are named filename.x.0000, filename.x.0001, and so on. The -d option tells split we want each file we create to be numbered, the -a 4 option tells it to use four digits, and the -l 20 option tells it to create files of (at most) twenty lines each. Don’t execute this command yet, however.
split -d -a 4 -l 20 filename.txt filename.x.
Instead we start by creating a directory to store the burst copies of our books. We then copy the originals into that directory.
cd mkdir burstdocs cd burstdocs cp ../download/*.txt . ls
The shell should respond with
23367.txt 23941.txt 24993.txt 25548.txt 26331.txt 28077.txt 28299-0.txt
We need to burst all of our books into pieces, not just one of them. We could type seven split commands, but that would be laborious. Instead we will take advantage of the bash shell’s ability to automate repetitive tasks by using a for loop. Here is the whole command
for fileName in $(ls -1 *.txt) ; do split -d -a 4 -l 20 $fileName $fileName.x ; done
This command makes use of command substitution. It starts by executing the command
ls -1 *.txt
which creates a list of file names, one per line. The for loop then steps through this list and puts each file name into the fileName variable, one item at a time. Each time it is executed, the split command looks in fileName to figure out what file it is supposed to be processing. All of the files that split outputs are placed in the current directory, i.e., burstdocs.
We don’t want to keep our original files in this directory after bursting them, so we delete them now.
rm ?????.txt ?????-0.txt
We can use ls to see that bursting our documents has resulted in a lot of small files. We want to rename them to get rid of the ‘txt.’ that occurs in the middle of each filename, and then we want to add a .txt extension to each.
rename 's/txt.//' * rename 's/$/.txt/' *
We can use ls to see that each filename now looks something like 28299-0.x0645.txt. We can also count the number of files in the burstdocs directory with
ls -1 | wc -l
There should be 1889 of them. Use cd to return to your home directory.
Swish-e, the Simple Web Indexing System for Humans – Extended
To build a simple search engine, we are going to use the Swish-e package. This is not installed by default on Debian Linux, so you may need to install it yourself. Check to see with
If it is not installed, the shell will respond with “No manual entry for swish-e”. In this case, you can install the package with
sudo aptitude install swish-e
Try the man command again. You should now have a man page for swish-e.
The next step is to create a configuration file called swish.conf in your home directory. It should contain the following lines
IndexDir burstdocs/ IndexOnly .txt IndexContents TXT* .txt IndexFile ./burstdocs.index
Now we make the index with
swish-e -c swish.conf
Searching with Swish-e
If we want to search for a particular word–say ‘tree’–we can use the following command. The -f option tells swish-e which index we want to search in. The -m option says to return the ten most relevant results, and the -w option is our search keyword.
swish-e -f burstdocs.index -m 10 -w tree
The output consists of a list of files, sorted in decreasing order of relevance. We can use less to look at the first few hits and confirm that they do, in fact, have something to do with trees.
less burstdocs/23667.x0253.txt burstdocs/28077.x0119.txt burstdocs/23941.x0106.txt
When using less to look at a number of files like this, we can move back and forth with :n for next file, and :p for previous file. As always, q quits.
There are probably more than ten relevant hits in our set of books. We can see the next ten results with the option -b 11 which tells swish-e to begin with the eleventh hit.
swish-e -f burstdocs.index -m 10 -w tree -b 11
The real advantage of using a search engine comes in finding documents that are relevant to more complex queries. For example, we could search for passages that were about both trees and squirrels with
swish-e -f burstdocs.index -m 10 -H 0 -x '%p\n' -w "tree AND squirrel"
The -H 0 option tells swish-e not to print a header before our results, and the -x option says that we only want to see matching filenames, one per line. We see that there are eight such passages.
It is a bit of a hassle to keep typing our long search command over and over, so we create an alias for that.
alias searchburst="swish-e -f burstdocs.index -m 10 -H 0 -x '%p\n' -w"
Now we can perform searches like
searchburst "tree AND squirrel" searchburst "(tree AND squirrel) AND NOT flying" searchburst "flying NEAR5 squirrel"
The last example returns results where ‘flying’ is within five words of ‘squirrel’.
A listing of relevant filenames is not very handy if we have to type each into the less command to check it out. Instead we can use the powerful xargs command to pipe our list of filenames into less.
searchburst "flying NEAR5 squirrel" | xargs less