Batch Downloading and Building Simple Search Engines with Command Line Tools in Linux

Introduction

In previous posts we downloaded a single book from the Internet Archive, calculated word frequencies, searched through it with regular expressions, and created a permuted term index. In this post, we extend our command line methods to include automatically downloading an arbitrarily large batch of files and building a simple search engine for our collection of sources.

In order to download a batch of files from the Internet Archive, we need a search term that will work on the advanced search page of the site. For the example here, I am going to be using the search

collection:gutenberg AND subject:"Natural history -- Juvenile literature"

There is a very nice post on the Internet Archive blog explaining the process in detail for a Mac or Windows machine. Here we will do everything at the Linux command line. For our trial we will only be using a few books, but the same method works just as well for hundreds or thousands of sources.

[UPDATE 2014. Because the subject categories at the Internet Archive were changed for one of the books since this post was written, I have had to make a few minor edits below. These do not change the sense of the original lesson.]

URL Encoding and the HTTP GET Method

First, a quick review of URL encoding and the HTTP GET method. Files on a web server are are stored in a nested directory structure that is similar to the Linux / UNIX filesystem. To request a file, you have to give your web browser (or a program like wget) a URL, or uniform resource locator. Think of this like the address of a file. It starts with a message telling the server what protocol you want to use to communicate (e.g., HTTP). This is followed by the name of the host (typically a domain name like archive.org), an optional port number (which we don’t need to deal with here) and then the path to the resource you want. If the resource is a file sitting in a directory, the path will look a lot like a file path in Linux. For example,

https://github.com/williamjturkel/Digital-Research-Methods/blob/master/README.md

For many sites, however, it is possible to send a custom query to the web server and receive some content as a result. One way of doing this is with the HTTP GET method. Your file path includes a query string like

?lastname=Andrews&firstname=Jane

and the server responds appropriately. (The exact query that you send will depend on the particular web server that you are contacting.)

Regardless of whether you are requesting a single file or sending an HTTP GET, there has to be a way of dealing with blank spaces, punctuation and other funky characters in the URL. This is handled by URL encoding, which converts the URL into a form which can be readily sent online.

When it is URL encoded, the query string that we are going to send to the Internet Archive

collection:gutenberg AND subject:"Natural history -- Juvenile literature"

becomes something like

collection%3Agutenberg%20AND%20subject%3A%22Natural+history+--+Juvenile+literature%22

collection%3Agutenberg+AND+subject%3A%22Natural+history+--+Juvenile+literature%22

We will see one way to do this URL encoding below. In the meantime, if you would like to use a browser to see which files we are going to be batch downloading, you can see the search results here. Make sure to look at the URL in your browser’s address bar.

Using cat to Build a Query String

The cat command gives us one quick way to create small files at the Linux command line. We want a file called beginquery that contains the following text

http://archive.org/advancedsearch.php?q=

To get that, we can enter the following command, type in the line of text that we want, press Return/Enter at the end of the line, then hit control-c

cat > beginquery

Now you can use

cat beginquery

to make sure the file looks like it should. If you made a mistake, you can use sed to fix it, or delete the file with rm and start again. Using the same method, create a file called endquery which contains the following, and check your work.

&fl[]=identifier&output=csv

We’ve created the beginning and end of a query string we are going to send to the Internet Archive using wget. We still need to URL encode the query itself, and then insert it into the string.

For URL encoding, we are going to use a slick method developed by Ruslan Spivak. We use the alias command to create a URL encoder with one line of Python code. (I won’t explain how this works here, but if you would like to learn more about Python programming for humanists, there are introductory lessons at the Programming Historian website.)

At the command line you can enter the following, then type alias to check your work.

alias urlencode='python -c "import sys, urllib as ul; print ul.quote_plus(sys.argv[1])"'

If you made a mistake, you can remove the alias with

unalias urlencode

and try again. If your urlencode alias is OK, you can now use it to create a query string for the IA.

urlencode 'collection:gutenberg AND subject:"Natural history -- Juvenile literature"' > querystring

We then use cat to put the three pieces of our query together, and check our work.

cat beginquery querystring endquery | tr -d '\n' | sed '$a\' > iaquery

Note that we had to remove the newlines from the individual pieces of our query, then add one newline at the end. This is what the tr and sed commands do in the pipeline above. Use cat to look at iaquery and make sure it looks OK. Mine looks like

http://archive.org/advancedsearch.php?q=collection%3Agutenberg+AND+subject%3A%22Natural+history+--+Juvenile+literature%22&fl%5B%5D=identifier&output=csv

If you want to download sources for a number of different Internet Archive searches, you just need to create new querystring and iaquery files for each.

Downloading a Batch of Files with wget

When we use wget to send the query that we just constructed to the Internet Archive, their webserver will respond with a list of item identifiers. The -i option to wget tells it to use the query file that we just constructed, and the -O option tells it to put the output in a file called iafilelist. We then clean up that file list by deleting the first line and removing quotation marks, as follows

wget -i iaquery -O iafilelist
cat iafilelist | tr -d [:punct:] | sed '1d' > iafilelist-clean

If you use cat to look at iafilelist-clean, it should contain the following filenames

rollosexperiment24993gut
achildsprimerofn26331gut
rollosmuseum25548gut
woodlandtales23667gut
countrywalksofan23941gut
theorbispictus28299gut

(If new files in this subject category are added to the Internet Archive, or if the metadata for any of these files are changed, your list of filenames may be different. At this point, you can edit iafilelist-clean to match the list above if you wish. Then your results should match mine exactly.)

We now have a list of files that we want to download from the Internet Archive. We will run wget again to get all of these files. Read the original post on the Internet Archive blog to learn more about the wget options being used. We create a directory to hold our downloads, then download the six books. This only takes a few seconds.

mkdir download
cd download
wget -r -H -nc -nd -np -nH --cut-dirs=2 -A .txt -e robots=off -l1 -i ../iafilelist-clean -B 'http://archive.org/download/'

I’ve chosen to work with a small batch of texts here for convenience, but essentially the same techniques work with a huge batch, too. If you do download a lot of files at once, you will probably want to remove the -nd option, so wget puts each source in a directory of its own. It is also very important not to hose other people’s web servers. You can learn more about polite uses of wget at the Programming Historian website.

Cleaning Up

When we downloaded the books, we ended up with a lot of metadata files and alternate copies. We will save the ones that we want and delete the rest. Look through the downloaded files with

ls -l | less

then clean up the metadata files and the other versions of the text

mkdir metadata
mv ?????.txt_meta.txt metadata/
mv ?????-0.txt_meta.txt metadata/
ls metadata

there should be six files in the download/metadata directory. Now you can get rid of the rest of the stuff in the download directory we won’t be using.

rm *meta.txt
rm pg*txt
rm *-8.txt

We are left with text versions of our six books in the download directory and the metadata files for those versions in the download/metadata directory. Use less to explore the texts and their associated metadata files.

Bursting Texts into Smaller Pieces

Note that we can use grep on multiple files. In the download directory, try executing the following command.

egrep "(bees|cows)" * | less

This will give us a list of all of the lines in our six books where one or both of the search terms appears. It is not very useful, however, because we don’t have much sense of the larger context in which the term is situated, and we don’t know how relevant each instance is. Is it simply a passing mention of a bee or cow, or a passage that is about both? To answer that kind of query we will want to build a simple search engine for our collection of sources.

The first step is to burst each book into small pieces, each of which will fit on one screen of our terminal. The reason that we are doing this is because it won’t do us much good to find out, say, that the book Woodland Tales mentions bees and cows somewhere. We want to see the exact places where both are mentioned on a single page.

In Linux we can use the split command to burst a large file into a number of smaller ones. For example, the command below shows how we would split a file called filename.txt into pieces that are named filename.x.0000, filename.x.0001, and so on. The -d option tells split we want each file we create to be numbered, the -a 4 option tells it to use four digits, and the -l 20 option tells it to create files of (at most) twenty lines each. Don’t execute this command yet, however.

split -d -a 4  -l 20 filename.txt filename.x.

Instead we start by creating a directory to store the burst copies of our books. We then copy the originals into that directory.

cd ..
mkdir burstdocs
cd burstdocs
cp ../download/*.txt .
ls

The shell should respond with

23367.txt 23941.txt 24993.txt 25548.txt 26331.txt 28299-0.txt

We need to burst all of our books into pieces, not just one of them. We could type six split commands, but that would be laborious. Instead we will take advantage of the bash shell’s ability to automate repetitive tasks by using a for loop. Here is the whole command

for fileName in $(ls -1 *.txt) ; do split -d -a 4 -l 20 $fileName $fileName.x ; done

This command makes use of command substitution. It starts by executing the command

ls -1 *.txt

which creates a list of file names, one per line. The for loop then steps through this list and puts each file name into the fileName variable, one item at a time. Each time it is executed, the split command looks in fileName to figure out what file it is supposed to be processing. All of the files that split outputs are placed in the current directory, i.e., burstdocs.

We don’t want to keep our original files in this directory after bursting them, so we delete them now.

rm ?????.txt ?????-0.txt

We can use ls to see that bursting our documents has resulted in a lot of small files. We want to rename them to get rid of the ‘txt.’ that occurs in the middle of each filename, and then we want to add a .txt extension to each.

rename 's/txt.//' *
rename 's/$/.txt/' *

We can use ls to see that each filename now looks something like 28299-0.x0645.txt. We can also count the number of files in the burstdocs directory with

ls -1 | wc -l

There should be 1698 of them. Use cd to return to the directory which contains burstdocs and download.

Swish-e, the Simple Web Indexing System for Humans – Extended

To build a simple search engine, we are going to use the Swish-e package. This is not installed by default on Debian Linux, so you may need to install it yourself. Check to see with

man swish-e

If it is not installed, the shell will respond with “No manual entry for swish-e”. In this case, you can install the package with

sudo aptitude install swish-e

Try the man command again. You should now have a man page for swish-e.

The next step is to create a configuration file called swish.conf in your current directory. It should contain the following lines

IndexDir burstdocs/
IndexOnly .txt
IndexContents TXT* .txt
IndexFile ./burstdocs.index

Now we make the index with

swish-e -c swish.conf

Searching with Swish-e

If we want to search for a particular word–say ‘bees’–we can use the following command. The -f option tells swish-e which index we want to search in. The -m option says to return the ten most relevant results, and the -w option is our search keyword.

swish-e -f burstdocs.index -m 10 -w bees

The output consists of a list of files, sorted in decreasing order of relevance. We can use less to look at the first few hits and confirm that they do, in fact, have something to do with bees.

less burstdocs/24993.x0113.txt burstdocs/24993.x0118.txt burstdocs/23667.x0077.txt

When using less to look at a number of files like this, we can move back and forth with :n for next file, and :p for previous file. As always, q quits.

There are probably more than ten relevant hits in our set of books. We can see the next ten results with the option -b 11 which tells swish-e to begin with the eleventh hit.

swish-e -f burstdocs.index -m 10 -w bees -b 11

The real advantage of using a search engine comes in finding documents that are relevant to more complex queries. For example, we could search for passages that were about both bees and cows with

swish-e -f burstdocs.index -m 10 -H 0 -x '%p\n' -w "bees AND cows"

The -H 0 option tells swish-e not to print a header before our results, and the -x option says that we only want to see matching filenames, one per line. We see that there are four such passages.

It is a bit of a hassle to keep typing our long search command over and over, so we create an alias for that.

alias searchburst="swish-e -f burstdocs.index -m 10 -H 0 -x '%p\n' -w"

Now we can perform searches like

searchburst "bees AND cows"
searchburst "(bees AND cows) AND NOT clover"
searchburst "bees NEAR5 cows"

The last example returns results where ‘bees’ is within five words of ‘cows’.

A listing of relevant filenames is not very handy if we have to type each into the less command to check it out. Instead we can use the powerful xargs command to pipe our list of filenames into less.

searchburst "bees NEAR5 cows" | xargs less

Batch Downloading and Building Simple Search Engines with Command Line Tools in Linux

Introduction

URL Encoding and the HTTP GET Method

Using cat to Build a Query String

Downloading a Batch of Files with wget

Cleaning Up

Bursting Texts into Smaller Pieces

Swish-e, the Simple Web Indexing System for Humans – Extended

Searching with Swish-e

0 Comments Comments are closed.

Projects

Recent

Archives

Categories

Batch Downloading and Building Simple Search Engines with Command Line Tools in Linux

Introduction

URL Encoding and the HTTP GET Method

Using cat to Build a Query String

Downloading a Batch of Files with wget

Cleaning Up

Bursting Texts into Smaller Pieces

Swish-e, the Simple Web Indexing System for Humans – Extended

Searching with Swish-e

Share this:

Related

0 Comments Comments are closed.

Projects

Recent

Archives

Categories