org.apache.tika.parser.microsoft
Class OfficeParser

java.lang.Object
  extended by org.apache.tika.parser.microsoft.OfficeParser
All Implemented Interfaces:
Parser
Direct Known Subclasses:
ExcelEventParser, ExcelParser, PowerPointParser, WordParser

public abstract class OfficeParser
extends java.lang.Object
implements Parser

Defines a Microsoft document content extractor.


Constructor Summary
OfficeParser()
           
 
Method Summary
protected abstract  void extractText(org.apache.poi.poifs.filesystem.POIFSFileSystem filesystem, java.lang.Appendable appendable)
          Extracts the text content from a Microsoft document input stream.
protected abstract  java.lang.String getContentType()
          The content type of the document being parsed.
 void parse(java.io.InputStream stream, org.xml.sax.ContentHandler handler, Metadata metadata)
          Extracts properties and text from an MS Document input stream
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

OfficeParser

public OfficeParser()
Method Detail

parse

public void parse(java.io.InputStream stream,
                  org.xml.sax.ContentHandler handler,
                  Metadata metadata)
           throws java.io.IOException,
                  org.xml.sax.SAXException,
                  TikaException
Extracts properties and text from an MS Document input stream

Specified by:
parse in interface Parser
Parameters:
stream - the document stream (input)
handler - handler for the XHTML SAX events (output)
metadata - document metadata (input and output)
Throws:
java.io.IOException - if the document stream could not be read
org.xml.sax.SAXException - if the SAX events could not be processed
TikaException - if the document could not be parsed

getContentType

protected abstract java.lang.String getContentType()
The content type of the document being parsed.

Returns:
MIME content type

extractText

protected abstract void extractText(org.apache.poi.poifs.filesystem.POIFSFileSystem filesystem,
                                    java.lang.Appendable appendable)
                             throws java.io.IOException,
                                    TikaException
Extracts the text content from a Microsoft document input stream.

Throws:
java.io.IOException
TikaException


Copyright © 2008 The Apache Software Foundation. All Rights Reserved.