Arachnid Web Spider Framework

Description

Arachnid is a Java-based web spider framework. It includes a simple HTML parser object that parses an input stream containing HTML content. Simple Web spiders can be created by sub-classing Arachnid and adding a few lines of code called after each page of a Web site is parsed. Two example spider applications are included to illustrate how to use the framework.

Warning

WARNING: A Web spider may put a large load on a server and a network. You may wish to do this by design - for instance when load testing YOUR server, using YOUR hosts and YOUR network. DO NOT use this software to place an excessive load on someone elses host and network resources without explicit permission!!

Author

This software was written by Robert Platt.

Use

Example

The following code uses Arachnid to generate a (very simplistic) site map for a Web site.


import java.io.*;
import java.net.*;
import java.util.*;
import bplatt.spider.*;

public class SimpleSiteMapGen {
  private String site;
  private final static String header = "<html><head><title>Site Map</title></head><body><ul>";
  private final static String trailer = "</ul></body></html>";
   
  public static void main(String[] args) {
    if (args.length != 1) {
      System.err.println("java SimpleSiteMapGen <url>");
      System.exit(-1);
    }
    SimpleSiteMapGen s = new SimpleSiteMapGen(args[0]);
    s.generate();
  }
  
  public SimpleSiteMapGen(String site) { this.site = site; }
  
  public void generate() {
    MySpider spider = null;
    try { spider = new MySpider(site); }
    catch(MalformedURLException e) {
      System.err.println(e);
      System.err.println("Invalid URL: "+site);
      return;
    }
    System.out.println(header);
    spider.traverse();
    System.out.println(trailer);
  }
}

class MySpider extends Arachnid {
  public MySpider(String base) throws MalformedURLException { super(base); }
  
  protected void handleLink(PageInfo p) {
    String link = p.getUrl().toString();
    String title = p.getTitle();
    if (link == null || title == null || link.length() == 0 || title.length() ==0) return;
    else System.out.println("<li><a href=\""+link+"\">"+title+"</a></li>");
  }
  protected void handleBadLink(URL url,URL parent, PageInfo p) { }
  protected void handleBadIO(URL url, URL parent) { }
  protected void handleNonHTMLlink(URL url, URL parent,PageInfo p) { }
  protected void handleExternalLink(URL url, URL parent) { }
}

Availability

The Arachnid Web Spider framework is available via SourceForge. Follow this link to obtain the source code. If you don't already have a Java Virtual Machine, you can obtain one from Sun Microsystems.

License

The Arachnid Web Spider framework is licensed under the GNU Public License. See GPL.txt for details. If you are unable or unwilling to abide by the terms of this license, please remove this code from your machine.

Support

The Arachnid Web Spider framework is distributed AS IS, with NO SUPPORT.

SourceForge Logo