Spider to Collect Sources

Once you start collecting large numbers of digital sources by searching for them or using an information trapping strategy, you will find that you are often in the position of wanting to download a lot of files from a given site. Obviously you can click on the links one at a time to make local copies, but that loses one of the main advantages of working with computers–letting them do your work for you. Instead, you should use a program like DownThemAll (a Firefox extension), SiteSucker (a standalone program) or GNU wget (a command line tool that you call from the terminal). Each of these automates what would otherwise be a very repetitive and thankless task. If you have never tried this before, start by getting comfortable with one of the more user-friendly alternatives like DownThemAll or SiteSucker, then move to wget when you need more power. You can also write your own custom programs to harvest sources, of course. There is an introductory lesson on this in the Programming Historian.

Now imagine that you have a harvester that is not only capable of downloading files, but of extracting the hyperlinks found on each page, navigating to the new pages and downloading those also. This kind of program is called a spider, crawler or bot. Search engine companies make extensive use of web spiders to create a constantly updated (and constantly out-of-date), partial map of the entire internet. For research, it is really nice to be able to spider more limited regions of the web in search of quality sources. Again, it is not too difficult to write your own spiders (see Kevin Hemenway and Tara Calishain’s Spidering Hacks) but there are also off-the-shelf tools which will do some spidering for you. In addition to writing my own spiders, I’ve used a number of these packages. Here I will describe DevonAgent.

DevonAgent includes a web browser with some harvesting capabilities built in. You can describe what you are looking for with a rich set of operators like “(Industrial BEFORE Revolution) AND (India NEAR/10 Textile*)”. Results are displayed with both relevance ranking and with an interactive graphical topic map that you can use to navigate. You can keep your results in an internal archive or export them easily to DevonThink. (More on this in a future post). You can schedule searches to run automatically, thus extending your information trapping strategy. DevonAgent also has what are called “scanners”: filters that recognize particular kinds of information. You can search, for example, for pages that contain PDFs, e-mail addresses, audio files, spreadsheets or webcams. You can also specify which URLs your spider will visit (including password protected sites). DevonAgent comes with about 80 plugins for search engines and databases, including sites like Google Scholar, IngentaConnect, the Internet Archive and Project Gutenberg. You can also write your own plugins in XML.

DevonAgent allows you to follow links automatically and to set the depth of the search. If you were looking at this blog post with DevonAgent, a Level 1 search would also retrieve the pages linked to by this post (for DownThemAll, etc.), it would retrieve some other pages from my blog, and so on. A Level 2 search would retrieve all of the stuff that a Level 1 search gets, plus the pages that the DownThemAll page links to, the pages linked to in some of my other blog posts, and so on. Since a spider is a program that you run while you are doing something else, it is OK if it goes down a lot of blind alleys in order to find things that are relatively rare. Learning where to start and how to tune the depth of your search is an essential part of using spidering for your research. DevonAgent will use Growl to notify you when its search is complete. (If there is something that I am eagerly awaiting, I also use Prowl to get Growl notifications when I’m away from my computer. But you may find that’s too much of a good thing.)

Spider to Collect Sources

0 Comments Comments are closed.

Projects

Recent

Archives

Categories

Spider to Collect Sources

Share this:

Related

0 Comments Comments are closed.

Projects

Recent

Archives

Categories