Archives for the month of: September, 2014

This past summer, Ian Milligan, Mary Beth Start and I worked on customizing a Debian Linux virtual machine for doing historical research using digital primary and secondary sources. The machine is called HistoryCrawler, and runs on both Macs and PCs. You can build one of your own by following the steps below.

1. Install VirtualBox and create a Debian Linux VM using the instructions here.

2. Disable attempts to load software from CDROM. Log into the VM, open a terminal (e.g., Konsole, UXterm) and using the following command, comment out the cdrom line of sources.list

sudo vi /etc/apt/sources.list

3. Update and upgrade.

sudo apt-get update
sudo apt-get upgrade

4. Install Guest Additions for VirtualBox VM using the reference here.

4a. Install DKMS.

sudo apt-get install dkms
sudo apt-get install linux-headers-3.14-1-486

4b. Reboot the guest system with Leave->Restart in KDE.

4c. Insert CD image. In VirtualBox menubar of guest machine (i.e., the Debian virtual machine) choose Devices->Insert Guest Additions CD Image.

4d. (Optional) Check drive has been mounted. Open up Dolphin File Manager. On the left hand side you should see the VBOXADDITIONS drive has been mounted.

4e. Open a terminal and enter the following commands.

cd /media/cdrom
sudo sh ./

4f. Reboot the guest system with Leave->Restart in KDE.

5. Drag and Drop. Set up bidirectional drag and drop (Devices->Drag’n’Drop) and shared clipboard. Try copying a URL from the host operating system and pasting it into Konqueror web browser with Ctrl-V. Then copy something from guest system with Ctrl-C and try pasting it in host OS.

6. Shared Folder. Shut the VM down completely, then follow these instructions to set up a shared folder. The instructions are actually older than the newest version of VirtualBox; you want to make your shared folder permanent and automount it. The shared folder is


6b. Permissions. Give the hcu user permissions to interact with shared folder by adding him/her to vboxsf group.

sudo usermod -a -G vboxsf hcu

6c. Reboot the guest system with Leave->Restart in KDE and then confirm that the hcu user can now access /media/sf_shared-folder at the terminal or with Dolphin.

6d. (Optional) Set up Kuser tool to manage users and groups. Access this tool with Applications->System->User Manager.

sudo apt-get install kuser

7. Install Zotero. Start Applications->Internet->Iceweasel and go to Install Zotero for Firefox then restart Iceweasel.

8. (Optional) Outwit Docs and Images. Start Applications->Internet->Iceweasel and go to Install for Firefox then restart Iceweasel. Then go to Install for Firefox then restart Iceweasel. (N.B. not sure if these are working properly).

9. Java JDK.

sudo apt-get install default-jdk

10. Image, text and document processing tools and OCR.

sudo apt-get imagej
sudo apt-get install pandoc
sudo apt-get install tre-agrep 
sudo apt-get install pdftk
sudo apt-get install tesseract-ocr tesseract-ocr-eng 

11. Stanford Natural Language Processing Tools. Install the CoreNLP package and the Named Entity Recognition (NER) package. The latter is actually included in the former, but we install it separately to maintain backwards compatibility with tutorials I have already written.

unzip stanford*.zip
rm stanford*.zip
mv stanford* stanford-corenlp
unzip stanford-ner*.zip
rm stanford-ner*.zip
mv stanford-ner* stanford-ner

12. Install csvfix. Be careful with the rm command!

unzip c21*zip
rm c21*zip
cd neilb*
make lin
sudo cp ./csvfix/bin/csvfix /usr/local/bin
cd ~
rm -r neilb*

13. Install graphviz and swish-e.

sudo apt-get install graphviz
sudo apt-get install swish-e

14. Install Javascript Libraries: D3.

sudo apt-get install libjs-d3

15a. Install Python Libraries: NLTK.

wget -O - | sudo python
sudo easy_install pip
sudo pip install -U numpy
sudo pip install -U pyyaml nltk

15b. Install Python Libraries: SciPy stack.

sudo apt-get install python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose

15c. Install Python Libraries: Scikit-learn.

sudo apt-get install python-sklearn

15d. Install Python Libraries: Beautiful Soup.

sudo apt-get install python-beautifulsoup

15e. Install Python Libraries: Internet Archive.

sudo pip install internetarchive

15f. Install Python Libraries: Orange.

sudo easy_install orange

16. Install Open Refine.

tar -xvf goo*gz
rm goo*gz
cd google-refine-2.5
./refine &

17. Install R.

sudo aptitude install r-base-dev
sudo aptitude install r-base-html r-doc-pdf

18. Install Overview Project.

unzip overview*zip
rm overview*zip
cd overview*
./run &

Leave that running, start Iceweasel, browse to http://localhost:9000 and log in as with password To quit, press Ctrl+C in the terminal where run is running. Don’t worry: your data is safe no matter when you quit.

19. Install SOLR. After this step it will be in ~/solr-4.8.1 and the example server can be run.

tar zxvf solr-4.8.1.tgz 
rm solr-4.8.1.tgz 

20. Install MAHOUT. This requires subversion and maven.

sudo apt-get install subversion
sudo apt-get install maven
svn co
cd trunk
mvn install

21. Install MALLET.

tar -zxvf mallet-2.0.7.tar.gz
rm mallet-2.0.7.tar.gz

Run the GUI .jar file from the home directory with

java -jar TopicModelingTool.jar

22. Web Archiving Tools. We are using a deprecated copy of WARC tools that is better with full text.

sudo apt-get install git
git clone

Copy Historian-WARC-1/warc/hanzo-warc directory to home directory and deleted Historian-WARC-1.

23. Install SEASR.

cd Meandre-1.4.12

In web browser navigate to http://localhost:1712/ and log in with

username: admin
password: admin
host: localhost
leave port at 1712

In workbench, open up the locations tab and add the following two locations to get default components and flows. Components (the first one) may take a few minutes because it’s downloading a ton of stuff. No worries.

When done, remember to return to the ‘Meandre-1.4.12’ directory and run


24. Install Voyant.

mkdir Voyant-Server
cd Voyant-Server
java -jar VoyantServer.jar

Open web browser and navigate to To stop the server, click Stop Server in the GUI.

25. Install Git Atom.

sudo apt-get install nodejs
sudo apt-get install libgnome-keyring-dev
tar xzvf node-v0.10.28.tar.gz
cd node-v0.10.28
./configure && make
sudo make install
git clone
cd atom
sudo script/grunt install

26. Use the HistoryCrawler VM. Here are some links to help you get started

Many digital humanists are probably aware that they could make their research activities faster and more efficient by working at the command line. Many are probably also sympathetic to arguments for open source, open content and open access. Nevertheless, switching to Linux full-time is a big commitment. Virtualization software, like Oracle’s free VirtualBox, allows one to create Linux machines that run inside a window on a Mac or PC. Since these virtual machines can be created from scratch whenever you need one, they make an ideal platform for learning command line techniques. They can also be customized for particular research tasks, as I will show in later posts.

In this post I show how to create a Debian Linux virtual machine inside VirtualBox. It has a GUI desktop installed (KDE), so you can interact with it both by using commands entered in a shell and by clicking with a mouse. The screenshots come from a Mac, but the install process should be basically the same for a Windows PC.

To get started, you need to download two things. The first of these is a disk image file (ISO) for the version of Linux you want to install. These files are different depending on the processor in your computer. Here the disk image that I will use is a 32-bit testing installation with KDE, for maximum compatibility with both Macs and PCs. Check the Debian distribution page for more details. The other thing that you need to download is the Oracle VirtualBox software for your operating system. Once you have downloaded VirtualBox, install it and then start it.

The image below shows the VirtualBox Manager running on my Mac. I have already created three other Linux virtual machines, but we can ignore these.

00-virtualbox-managerTo create a new virtual machine, click the “New” button in the upper left hand corner of the Manager. Debian Linux comes in three standard flavours, known as “stable,” which is very solid but not very up-to-date, “testing,” which is pretty solid and reasonably up-to-date, and “unstable,” which is just that. I am going to name my machine ‘HistoryCrawler’ along with some information about the date and processor.  You can call yours whatever you’d like.

historycrawler-01Once you click “Continue,” the VirtualBox software will ask you a number of questions. For this installation we are going to use a memory size of 1024 megabytes of RAM (this can be increased later), a virtual hard drive formatted as a VDI (VirtualBox Disk Image), dynamically allocated disk storage, and 16 gigabytes for the virtual machine.







(I originally set the virtual hard drive size to 8 GB but we later had to increase it to 16 GB. So choose 16 GB here, despite the pictures below.)


Once we have set all of the options for the virtual machine, we are returned to the VirtualBox Manager.


Now we choose the virtual machine we just created and click the “Start” button in the Manager. The new machine starts with a message about how the mouse is handled when the cursor is over the virtual machine window.


Once you’ve read and accepted the message, the virtual machine will ask you for a start-up disk.

09-select-startup-diskClick the file icon with the green up arrow on it, and you will be given a dialog that lets you choose the Debian ISO file you downloaded earlier.


The ISO file is now selected.


When you click “start” the Debian Install process will begin in the virtual machine window.


You can move around the installer options with the Up and Down arrows and Tab key. Use the Enter key to select an item. If there are options, you can usually turn them on or off with the Space bar. Here, press Enter to choose the “Install” option.

Next you want to select your language, location and preferred keyboard layout.



15-choose-keyboardThe installer will ask you for a hostname and a domain name. You can set the former to whatever you’d like; leave the latter blank unless you have a reason to set it.



17-blank-domain-nameNext, the installer will ask you for a root password. In Linux and Unix systems, the root account typically has the power to do everything, good and bad. Rather than setting a root password, we are going to leave the root password entry blank. The installer will respond by not creating a root account, but rather by giving the user account (i.e., you) sudo privileges.


19-confirm-blank-root-passwordNow that the root account is disabled, you can enter your own name, username and password, and set the time zone.







24-set-timezoneThe next set of screens ask you to specify how you would like the file system to be set up. As before, we will use the defaults. Later, when you are more familiar with creating virtual machines for specific tasks, you can tweak these as desired. We want guided partitioning, and we are going to use the entire virtual disk (this is the 8Gb dedicated to this particular virtual machine)

25-guided-entire-disk-partitionWe only have one disk to partition, so we choose it.

26-partition-disksWe want all of our files in one partition for now.  Later, if you decide to do a lot of experimentation with Linux you may prefer to put your stuff in separate partitions when you create new virtual machines.

27-all-files-one-partitionWe can finish the partitioning…


and write the changes to disk.

29-write-changes-to-diskNow the install process will ask us if we want to use other disk image files.  We do not.

30-dont-scan-another-diskWe are going to grab install files from the Internet instead of from install disk images. (If you are working in a setting where downloads are expensive, you may not wish to do this.) We set up a network mirror to provide the install files.


Tell the installer what country you are in.

32-choose-mirror-countryThen choose a Debian archive mirror. The default mirror is a good choice.

33-choose-mirrorNow the installer will ask if we want to use a proxy server. Leave this blank unless you have a reason to change it.

34-blank-http-proxyI opt out of the popularity contest.

35-no-popularity-contestDebian gives you a lot of options for pre-installed bundles of software.  Here we are going to install the Debian Desktop Environment (the GUI, KDE), the web and print servers, a SQL database, file server, SSH server and standard system utilities. If you are working on a laptop, check the laptop box, too.


The final step is to install the Grub bootloader.



Now the virtual machine will reboot when you click “Continue”.

38-finish-installationThis is the login prompt for your new Debian virtual machine.


You can use Linux commands to shutdown the virtual machine if you would like.  You can also save it in such a way that it will resume where you left off when you reload it in VirtualBox. In the VirtualBox Manager, right click on the virtual machine and choose “Close”->”Save State”. That is shown in the next screenshot.

40-close-vmYou can save backups of your virtual machine whenever you reach a crucial point in your work, store VMs in the cloud, and share them with colleagues or students. You can also create different virtual machines for different tasks and use them to try out other Linux distributions. On my Macs, I also have Win XP and Win 7 VMs so I can run Windows-only software.