Apache Incubator
Font size:      

Welcome to Apache Droids

News

09 August 2012 - Second Release

The second release of Droids Incubating has been made, version 0.2-incubating. Links to download are below. The binary artifacts are available on Maven Central for ease of use.

03 November 2011 - First Release

The first release of Droids Incubating has been made, version 0.1.0-incubating. Links to download are below. The binary artifacts are available on Maven Central for ease of use.

What is this?

Droids aims to be an intelligent standalone robot framework that allows to create and extend existing droids (robots). In the future it will offer an administration application to manage and controll the different droids.

Droids makes it very easy to extend existing robots or write a new one from scratch, which can automatically seek out relevant online information based on the user's specifications.

Droids (plural) is not designed for a special usecase, it is a framework: Take what you need, do what you want.

Droids offers you following the components so far:

  • Queue, a queue is the data structure where the different tasks are waiting for service.
  • Protocol, the protocol interface is a wrapper to hide the underlying implementation of the communication at protocol level.
  • Parser -> Apache Tika, the parser component is just a wrapper for tika since it offers everything we need. No need to duplicate the effort. The Paser component parses different input types to SAX events.
  • Handler, a handler is a component that uses the original stream and/or the parse (ContentHandler coming from Tika) and the url to invoke arbitrary business logic on the objects. Unless like the other components different handler can be applied on the stream/parse

A Droid (singular) however is all about ONE special usecase. For example the helloCrawler is a wget style crawler. Meaning you go to a page extract the links and save the page afterward to the file system. The focus of the helloCrawler is this special usecase and to solve it hello uses different components.

In the future there could evolve different subprojects that are providing specialist components for a special use case. However if components get used in different usecases they should be considered common.

Download

Download source below Droids 0.2.0-incubating Source code (ZIP archive), ASC Signature, SHA1 Checksum, MD5 Checksum

Binary releases are available from Maven Central.

When downloading from a mirror please check the SHA1 checksums as well as verifying the OpenPGP compatible signature available from the main Apache site. The KEYS file contains the public keys used for signing release. It is recommended that a web of trust is used to confirm the identity of these keys.

You can check the OpenPGP signature with:

gpg --verify apache-droids-*.tar.gz.asc

You can check the SHA-512 checksum with:

sha512sum apache-droids-*.tar.gz

You can check the MD5 checksum with:

md5sum apache-droids-*.tar.gz

Install

Since Droids is ATM developing very fast we decided to maintain our documentation in the wiki to allow all user to participate. Please see our installation guide on how to get started with droids.

Feature list

  • Customizable. Completely controlled by its default.properties which can be easily be overridden by creating a file build.properties and overriding the default properties that are needed.
  • Multi-threaded. The architecture is that a robot (e.g. HelloCrawler controls various worker (threads) that are doing the actual work.
  • Honor robots.txt. By default droids honors the robot.txt. However you can turn on the hostile mode of a droid (droids.protocol.http.force=true).
  • Crawl throttling. You can configure the amount of concurrent threads that a droid can distribute to their workers (droids.maxThreads=5) and the delay time between the requests (droids.delay.request=500). You can use one of the different delay components:
    • SimpleDelayTimer
    • RandomDelayTimer
    • GaussianRandomDelayTime
  • Spring based - dynamics. The properties mentioned above get picked up by the build process which inject them in the spring configuration.
  • Extensible - dynamics. The spring configuration makes usage of the cocoon-configurator and its dynamic registry support (making extending droids a pleasure).

Architecture

The following graph shows the basic architecture of droids with the help of the first implementation (helloCrawler).

Overview

Why was it created?

Mainly because of personal curiosity and an usecase: The background of this work is that Cocoon trunk does not provide a crawler anymore and Forrest is based on it, meaning we cannot update anymore till we found a crawler replacement. Getting more involved in Solr and Nutch we saw request for a generic standalone crawler.

Requirements

  • JDK 1.6 or higher
HEADSUP
!!! Please ONLY crawl localhost NEVER a internet site when you test the first time!!! You will need to adjust the urlfilters to limit loops.

Links / related projects

Disclaimer

Apache Droids is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Lucene and HttpComponents PMCs. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.