Apache UIMA
UIMA Annotators
UIMA project logo

Search the site

 UIMA Annotators

Annotators do the real work of extracting structured information from unstructured data. You can write your own annotators, use the annotators available here, or find UIMA annotators on the web, often at various repositories.

Several annotators are available as part of Apache UIMA; they are packaged up into one downloadable package. You can access the source for the annotators in the SVN repository at http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/.

Here are the annotators available here:

Each of these annotators is described in more detail below.

Whitespace Tokenizer Annotator

The Whitespace tokenizer annotator component provides an UIMA annotator implementation that tokenizes text documents using a simple whitespace segmentation. During the tokenization, the annotator creates token and sentence annotations as result. The Java source of the annotator can be accessed in the SVN repository at http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/WhitespaceTokenizer.

Snowball Annotator

The Snowball annotator is an UIMA annotator component that wraps the Snowball stemming algorithm. The annotator iterates over the available token annotations in the CAS and creates for each token a feature containing the stem. The stemming algorithm is avaialble for several languages. For details about Snowball please see http://snowball.tartarus.org/. The Java source of the annotator can be accessed in the SVN repository at http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/SnowballAnnotator.

Note: the used implementation of the Snowball stemming algorithm is licensed under the BSD license.

Regular Expression Annotator

The Regular Expression Annotator (RegexAnnotator) is an Apache UIMA analysis engine that detects entities like email addresses, URLs, phone numbers, zip codes or any other entity based on regular expressions and concepts. For each entity that was detected an annotation can be created or an already existing annotation can be updated with feature values. Click here to access the user documentation for the Regular Expression Annotator. The Java source of the annotator can be accessed in the SVN repository at http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/RegularExpressionAnnotator.

Dictionary Annotator

The Dictionary Annotator is an Apache UIMA analysis engine that creates annotations based on word lists that are compiled to simple dictionaries. The output annotation type for the annotations that are created and the input annotation type where the dictionary lookup is executed on, can be specified individually. Click here to access the user documentation for the Dictionary Annotator. The Java source of the annotator can be accessed in the SVN repository at http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/DictionaryAnnotator.

Tagger Annotator

The Tagger Annotator component implements a Hidden Markov Model (HMM) tagger. The tagger assumes that sentences and tokens have already been annotated in the CAS with sentence and token annotations. It iterates then in turn over sentences and tokens to accumulate a list of words, and then invokes the tagger on this list. The HMM tagger employs the Viterbi algorithm to calculate the most probable tag sequence. For each Token it updates the posTag field with the part of speech tag. Model training is happening outside of UIMA, the tagger just receives statistical information from a model file which is passed to the tagger along with some further parameters through a properties file. The Java source of the annotator can be accessed in the SVN repository at http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/Tagger.

BSF Annotator

The Bean Scripting Framework (BSF) Annotator is an Apache UIMA analysis engine that provides a link between the UIMA framework and the scripting languages that are supported by Apache BSF (http://jakarta.apache.org/bsf). The current implementation comes with examples in Beanshell (http://www.beanshell.org) and Rhino Javascript (http://www.mozilla.org/rhino). Simple tests have also been conducted successfully with Jython (http://jython.sourceforge.net/Project/index.html) and JRuby (http://jruby.codehaus.org). The annotator takes as parameter the source file containing the script. The script is supposed to implement the initialize and process functions of the analysis engine. Using a scripting language can be very handy to do quick prototyping, pre/post processing, CAS cleaning tasks or typeystem conversion/adaptation. The Java source of the annotator can be accessed from the SVN repository at http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/BSFAnnotator.

OpenCalais Annotator

The OpenCalais Annotator component wraps the OpenCalais web service and makes the OpenCalais analysis results available in UIMA. OpenCalais can detect a large variety of entities, facts and events like for example Persons, Companies, Acquisitions, Mergers, etc. For details about the OpenCalais analytics and the license to use the service, please refer to the to the OpenCalais website. The Java source of the annotator can be accessed in the SVN repository at http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/OpenCalaisAnnotator.

Concept Mapper Annotator

ConceptMapper is a powerful, highly configurable dictionary UIMA-based annotator.

Numerous parameters can be used to specify various aspects of the lookup algorithm, input processing and output options. The dictionary structure is flexible, allowing any number synonyms to be associated with an entry, and any number of attributes to be associated with entries or synonyms.

Additionally, ConceptMapper can be used with any tokenizer, enabling tokenization of the dictionary identically with the input text.