org.apache.poi.hwpf.extractor
Class WordExtractor

java.lang.Object
  extended by org.apache.poi.POITextExtractor
      extended by org.apache.poi.POIOLE2TextExtractor
          extended by org.apache.poi.hwpf.extractor.WordExtractor

public final class WordExtractor
extends POIOLE2TextExtractor

Class to extract the text from a Word Document. You should use either getParagraphText() or getText() unless you have a strong reason otherwise.

Author:
Nick Burch

Field Summary
 
Fields inherited from class org.apache.poi.POITextExtractor
document
 
Constructor Summary
WordExtractor(DirectoryNode dir)
           
WordExtractor(DirectoryNode dir, POIFSFileSystem fs)
          Deprecated. Use WordExtractor(DirectoryNode) instead
WordExtractor(HWPFDocument doc)
          Create a new Word Extractor
WordExtractor(java.io.InputStream is)
          Create a new Word Extractor
WordExtractor(POIFSFileSystem fs)
          Create a new Word Extractor
 
Method Summary
 java.lang.String[] getCommentsText()
           
 java.lang.String[] getEndnoteText()
           
 java.lang.String getFooterText()
          Deprecated. 
 java.lang.String[] getFootnoteText()
           
 java.lang.String getHeaderText()
          Deprecated. 
 java.lang.String[] getMainTextboxText()
           
 java.lang.String[] getParagraphText()
          Get the text from the word file, as an array with one String per paragraph
protected static java.lang.String[] getParagraphText(Range r)
           
 java.lang.String getText()
          Grab the text, based on the WordToTextConverter.
 java.lang.String getTextFromPieces()
          Grab the text out of the text pieces.
static void main(java.lang.String[] args)
          Command line extractor, so people will stop moaning that they can't just run this.
static java.lang.String stripFields(java.lang.String text)
          Removes any fields (eg macros, page markers etc) from the string.
 
Methods inherited from class org.apache.poi.POIOLE2TextExtractor
getDocSummaryInformation, getFileSystem, getMetadataTextExtractor, getRoot, getSummaryInformation
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WordExtractor

public WordExtractor(java.io.InputStream is)
              throws java.io.IOException
Create a new Word Extractor

Parameters:
is - InputStream containing the word file
Throws:
java.io.IOException

WordExtractor

public WordExtractor(POIFSFileSystem fs)
              throws java.io.IOException
Create a new Word Extractor

Parameters:
fs - POIFSFileSystem containing the word file
Throws:
java.io.IOException

WordExtractor

@Deprecated
public WordExtractor(DirectoryNode dir,
                                POIFSFileSystem fs)
              throws java.io.IOException
Deprecated. Use WordExtractor(DirectoryNode) instead

Throws:
java.io.IOException

WordExtractor

public WordExtractor(DirectoryNode dir)
              throws java.io.IOException
Throws:
java.io.IOException

WordExtractor

public WordExtractor(HWPFDocument doc)
Create a new Word Extractor

Parameters:
doc - The HWPFDocument to extract from
Method Detail

main

public static void main(java.lang.String[] args)
                 throws java.io.IOException
Command line extractor, so people will stop moaning that they can't just run this.

Throws:
java.io.IOException

getParagraphText

public java.lang.String[] getParagraphText()
Get the text from the word file, as an array with one String per paragraph


getFootnoteText

public java.lang.String[] getFootnoteText()

getMainTextboxText

public java.lang.String[] getMainTextboxText()

getEndnoteText

public java.lang.String[] getEndnoteText()

getCommentsText

public java.lang.String[] getCommentsText()

getParagraphText

protected static java.lang.String[] getParagraphText(Range r)

getHeaderText

@Deprecated
public java.lang.String getHeaderText()
Deprecated. 

Grab the text from the headers


getFooterText

@Deprecated
public java.lang.String getFooterText()
Deprecated. 

Grab the text from the footers


getTextFromPieces

public java.lang.String getTextFromPieces()
Grab the text out of the text pieces. Might also include various bits of crud, but will work in cases where the text piece -> paragraph mapping is broken. Fast too.


getText

public java.lang.String getText()
Grab the text, based on the WordToTextConverter. Shouldn't include any crud, but slower than getTextFromPieces().

Specified by:
getText in class POITextExtractor
Returns:
All the text from the document

stripFields

public static java.lang.String stripFields(java.lang.String text)
Removes any fields (eg macros, page markers etc) from the string.



Copyright 2012 The Apache Software Foundation or its licensors, as applicable.