Welcome to Apache Droids
09 August 2012 - Second Release
The second release of Droids Incubating has been made, version 0.2-incubating. Links to download are below. The binary artifacts are available on Maven Central for ease of use.
03 November 2011 - First Release
The first release of Droids Incubating has been made, version 0.1.0-incubating. Links to download are below. The binary artifacts are available on Maven Central for ease of use.
What is this?
Droids aims to be an intelligent standalone robot framework that allows to create and extend existing droids (robots). In the future it will offer an administration application to manage and controll the different droids.
Droids makes it very easy to extend existing robots or write a new one from scratch, which can automatically seek out relevant online information based on the user's specifications.
Droids (plural) is not designed for a special usecase, it is a framework: Take what you need, do what you want.
Droids offers you following the components so far:
- Queue, a queue is the data structure where the different tasks are waiting for service.
- Protocol, the protocol interface is a wrapper to hide the underlying implementation of the communication at protocol level.
- Parser -> Apache Tika, the parser component is just a wrapper for tika since it offers everything we need. No need to duplicate the effort. The Paser component parses different input types to SAX events.
- Handler, a handler is a component that uses the original stream and/or the parse (ContentHandler coming from Tika) and the url to invoke arbitrary business logic on the objects. Unless like the other components different handler can be applied on the stream/parse
A Droid (singular) however is all about ONE special usecase. For example the helloCrawler is a wget style crawler. Meaning you go to a page extract the links and save the page afterward to the file system. The focus of the helloCrawler is this special usecase and to solve it hello uses different components.
In the future there could evolve different subprojects that are providing specialist components for a special use case. However if components get used in different usecases they should be considered common.
Binary releases are available from Maven Central.
When downloading from a mirror please check the SHA1 checksums as well as verifying the OpenPGP compatible signature available from the main Apache site. The KEYS file contains the public keys used for signing release. It is recommended that a web of trust is used to confirm the identity of these keys.
You can check the OpenPGP signature with:
gpg --verify apache-droids-*.tar.gz.asc
You can check the SHA-512 checksum with:
You can check the MD5 checksum with:
- Customizable. Completely controlled by its default.properties which can be easily be overridden by creating a file build.properties and overriding the default properties that are needed.
- Multi-threaded. The architecture is that a robot (e.g. HelloCrawler controls various worker (threads) that are doing the actual work.
- Honor robots.txt. By default droids honors the robot.txt. However you can turn on the hostile mode of a droid (droids.protocol.http.force=true).
You can configure the amount of concurrent threads that a
droid can distribute to their workers (droids.maxThreads=5)
and the delay time between the requests
(droids.delay.request=500). You can use one of the different
- Spring based - dynamics. The properties mentioned above get picked up by the build process which inject them in the spring configuration.
- Extensible - dynamics. The spring configuration makes usage of the cocoon-configurator and its dynamic registry support (making extending droids a pleasure).
The following graph shows the basic architecture of droids with the help of the first implementation (helloCrawler).
Why was it created?
Mainly because of personal curiosity and an usecase: The background of this work is that Cocoon trunk does not provide a crawler anymore and Forrest is based on it, meaning we cannot update anymore till we found a crawler replacement. Getting more involved in Solr and Nutch we saw request for a generic standalone crawler.
- JDK 1.6 or higher
Links / related projects
- Nutch web-search software
- The Web Robots Pages
- Programming webcrawler
- Writing a Web Crawler in the Java Programming Language
- Crawling AJAX
- Crowbar is a web scraping environment based on the use of a server-side headless mozilla-based browser.
- OSCube is a framework of sorts. I find myself writing a particular type of application, some kind of job engine, repeatedly and OSCube is the generic version there-of. It uses Simple-JNDI, and therefore really just basic JNDI, as its configuration system, and the Quartz scheduler for Cron-work.