Arachnid Web Spider Framework
Description
Arachnid is a Java-based web spider framework. It includes a
simple HTML parser object that parses an input stream containing
HTML content. Simple Web spiders can be created by sub-classing
Arachnid and adding a few lines of code called after each page
of a Web site is parsed. Two example spider applications are
included to illustrate how to use the framework.
Warning
WARNING:
A Web spider may put a large load on a server and a
network. You may wish to do this by design - for instance when
load testing YOUR server, using YOUR hosts and YOUR network.
DO NOT use this software to place an excessive load on someone
elses host and network resources without explicit permission!!
Author
This software was written by Robert Platt.
Use
- Build a Arachnid.jar file using build.xml and Ant. You can
also build documentation using the 'docs' target.
- Add the jar file to your CLASSPATH
- Arachnid is an abstract base class that uses the
"visitor" pattern. It has a "traverse()" method
that walks through a Web site. For each (valid) page
in the site it calls the abstract method handleLink().
You need to dervie a sub-class from Arachnid and define
a handleLink() method. This will be called for each
and every valid page in the Web site.
A PageInfo object is passed to handleLink().
The PageInfo object contains useful information about
the Web page. Four other methods must be defined:
- handleBadLink() - for processing an invalid URL
- handleNonHTMLlink() - for processing links to non-HTML resources
- handleExternalLink() - for processing links that are outside the Web site
- handleBadIO() - in the event of an I/O problem while attempting to process a Web page
Instantiate your sub-class and call traverse().
- Compile your application and run it.
Example
The following code uses Arachnid to generate a (very simplistic) site
map for a Web site.
import java.io.*;
import java.net.*;
import java.util.*;
import bplatt.spider.*;
public class SimpleSiteMapGen {
private String site;
private final static String header = "<html><head><title>Site Map</title></head><body><ul>";
private final static String trailer = "</ul></body></html>";
public static void main(String[] args) {
if (args.length != 1) {
System.err.println("java SimpleSiteMapGen <url>");
System.exit(-1);
}
SimpleSiteMapGen s = new SimpleSiteMapGen(args[0]);
s.generate();
}
public SimpleSiteMapGen(String site) { this.site = site; }
public void generate() {
MySpider spider = null;
try { spider = new MySpider(site); }
catch(MalformedURLException e) {
System.err.println(e);
System.err.println("Invalid URL: "+site);
return;
}
System.out.println(header);
spider.traverse();
System.out.println(trailer);
}
}
class MySpider extends Arachnid {
public MySpider(String base) throws MalformedURLException { super(base); }
protected void handleLink(PageInfo p) {
String link = p.getUrl().toString();
String title = p.getTitle();
if (link == null || title == null || link.length() == 0 || title.length() ==0) return;
else System.out.println("<li><a href=\""+link+"\">"+title+"</a></li>");
}
protected void handleBadLink(URL url,URL parent, PageInfo p) { }
protected void handleBadIO(URL url, URL parent) { }
protected void handleNonHTMLlink(URL url, URL parent,PageInfo p) { }
protected void handleExternalLink(URL url, URL parent) { }
}
Availability
The Arachnid Web Spider framework is available via SourceForge.
Follow this link
to obtain the source code. If you don't already
have a Java Virtual Machine, you can obtain one from
Sun Microsystems.
License
The Arachnid Web Spider framework is licensed under the GNU Public License. See GPL.txt for
details. If you are unable or unwilling to abide by the terms of
this license, please remove this code from your machine.
Support
The Arachnid Web Spider framework is distributed AS IS, with NO SUPPORT.