Apache UIMA
UIMA Sandbox
UIMA project logo

Search the site

 UIMA Sandbox

The UIMA sandbox is a workspace that is open to all UIMA committers and developers who would like to contribute code and join the UIMA developer community. The sandbox hosts analysis components and tooling around UIMA. All the components are free to use and licensed under the Apache Software License.

Components often migrate from here to other parts of the site, over time, as part of the process of integration by the Apache community.

A list of proposed analysis components and tooling for UIMA is available at the UIMA wiki and can be discussed there.

You can access the UIMA sandbox in the SVN repository at http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/.

The list below shows the currently available components of the UIMA sandbox. Many of these components are annotators, and are released - see the download page.

 UIMA sandbox components

Annotators and Consumers

Servers

Packaging tools

Miscellaneous

These are described in more detail below.


Whitespace Tokenizer Annotator

The Whitespace tokenizer annotator component provides an UIMA annotator implementation that tokenizes text documents using a simple whitespace segmentation. During the tokenization, the annotator creates token and sentence annotations as result. The Java source of the annotator can be accessed in the SVN repository at http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/WhitespaceTokenizer.

Snowball Annotator

The Snowball annotator is an UIMA annotator component that wraps the Snowball stemming algorithm. The annotator iterates over the available token annotations in the CAS and creates for each token a feature containing the stem. The stemming algorithm is avaialble for several languages. For details about Snowball please see http://snowball.tartarus.org/. The Java source of the annotator can be accessed in the SVN repository at http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/SnowballAnnotator.

Note: the used implementation of the Snowball stemming algorithm is licensed under the BSD license.

Regular Expression Annotator

The Regular Expression Annotator (RegexAnnotator) is an Apache UIMA analysis engine that detects entities like email addresses, URLs, phone numbers, zip codes or any other entity based on regular expressions and concepts. For each entity that was detected an annotation can be created or an already existing annotation can be updated with feature values. Click here to access the user documentation for the Regular Expression Annotator. The Java source of the annotator can be accessed in the SVN repository at http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/RegularExpressionAnnotator.

PEAR Packaging ANT Task

The PEAR packaging ANT task component is a project to create UIMA PEAR packages automatically during a component build using a custom Apache ANT task. With this task, users are able to build their components from the source and then package them automatically as UIMA PEAR package. Click here to access the user documentation for the PEAR packaging ANT task. The Java source of the PEAR packaging task can be accessed in the SVN repository at http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/PearPackagingAntTask.

PEAR Packaging Maven Plugin

The PEAR packaging Maven plugin component is a project to create UIMA PEAR packages automatically during a component build using a custom Maven plugin. With this plugin, users are able to build their components from the source and then package them automatically as UIMA PEAR package. Click here to access the user documentation of the PEAR packaging Maven plugin. The Java source of the PEAR packaging Maven plugin can be accessed in the SVN repository at http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/PearPackagingMavenPlugin.

Dictionary Annotator

The Dictionary Annotator is an Apache UIMA analysis engine that creates annotations based on word lists that are compiled to simple dictionaries. The output annotation type for the annotations that are created and the input annotation type where the dictionary lookup is executed on, can be specified individually. Click here to access the user documentation for the Dictionary Annotator. The Java source of the annotator can be accessed in the SVN repository at http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/DictionaryAnnotator.

Feature Structure Variables

The Feature Structure variables project allows you to create named feature structure instances. It further allows you to refer to individual feature structures or annotations across annotators, without creating a special index. The Java source of the project can be accessed in the SVN repository at http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/FsVariables.

Tagger Annotator

The Tagger Annotator component implements a Hidden Markov Model (HMM) tagger. The tagger assumes that sentences and tokens have already been annotated in the CAS with sentence and token annotations. It iterates then in turn over sentences and tokens to accumulate a list of words, and then invokes the tagger on this list. The HMM tagger employs the Viterbi algorithm to calculate the most probable tag sequence. For each Token it updates the posTag field with the part of speech tag. Model training is happening outside of UIMA, the tagger just receives statistical information from a model file which is passed to the tagger along with some further parameters through a properties file. The Java source of the annotator can be accessed in the SVN repository at http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/Tagger.

BSF Annotator

The Bean Scripting Framework (BSF) Annotator is an Apache UIMA analysis engine that provides a link between the UIMA framework and the scripting languages that are supported by Apache BSF (http://jakarta.apache.org/bsf). The current implementation comes with examples in Beanshell (http://www.beanshell.org) and Rhino Javascript (http://www.mozilla.org/rhino). Simple tests have also been conducted successfully with Jython (http://jython.sourceforge.net/Project/index.html) and JRuby (http://jruby.codehaus.org). The annotator takes as parameter the source file containing the script. The script is supposed to implement the initialize and process functions of the analysis engine. Using a scripting language can be very handy to do quick prototyping, pre/post processing, CAS cleaning tasks or typeystem conversion/adaptation. The Java source of the annotator can be accessed from the SVN repository at http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/BSFAnnotator.

Tika Annotator

Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. The TikaAnnotator uses Tika to generate annotations representing the original markup of a document, extract its text and metadata. It consists of three resources:

FileSystemCollectionReader
similar to the one in UIMA examples but uses TIKA to extract the text from binary documents and generates annotations to represent the markup
MarkupAnnotator
takes the original content from a view and generates a new view containing the extracted text with markup annotations
TikaWrapper
utility class which allows to populate a CAS from a binary document; used by the FileSystemCollectionReader

Lucene CAS indexer (Lucas)

The Lucene CAS indexer (Lucas) is a UIMA CAS consumer that stores CAS data in a Lucene index. The consumer transforms annotation objects of a CAS into Lucene token streams which are stored in a Lucene document. Token streams can further be processed by token filters. Lucas comes with a set of its own token filters and integrations for some Lucene token filters. Furthermore, you can deploy your own token filters. The mapping between UIMA annotations and Lucene tokens and token filtering is configured by a xml mapping file. The Java source of the consumer can be accessed in the SVN repository.

Simple Server (UIMA REST Service)

The UIMA Simple Server makes results of UIMA processing available in a simple, XML-based format. The intended use of the the Simple Server is to provide UIMA analysis as a REST service. The Simple Server is implemented as a Java Servlet, and can be deployed into any Servlet container (such as Apache Tomcat or Jetty). Click here to access the user documentation of the Simple Server.

The Java source of the annotator can be accessed from the SVN repository at http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/SimpleServer .

OpenCalais Annotator

The OpenCalais Annotator component wraps the OpenCalais web service and makes the OpenCalais analysis results available in UIMA. OpenCalais can detect a large variety of entities, facts and events like for example Persons, Companies, Acquisitions, Mergers, etc. For details about the OpenCalais analytics and the license to use the service, please refer to the to the OpenCalais website. The Java source of the annotator can be accessed in the SVN repository at http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/OpenCalaisAnnotator.

Concept Mapper Annotator

ConceptMapper is a powerful, highly configurable dictionary UIMA-based annotator.

Numerous parameters can be used to specify various aspects of the lookup algorithm, input processing and output options. The dictionary structure is flexible, allowing any number synonyms to be associated with an entry, and any number of attributes to be associated with entries or synonyms.

Lookup and matching against dictionary entries can be performed against contiguous or non-contiguous blocks of text, and token order independent lookup is also allowed (for example, the tokens "A" "B" would be considered a match against dictionary entry "B" "A").

Additionally, ConceptMapper can be configured to use any tokenizer annotator, enabling tokenization of the dictionary identically with the input text.