org.apache.poi.hslf.extractor
Class QuickButCruddyTextExtractor

java.lang.Object
  extended by org.apache.poi.hslf.extractor.QuickButCruddyTextExtractor

public final class QuickButCruddyTextExtractor
extends java.lang.Object

This class will get all the text from a Powerpoint Document, including all the bits you didn't want, and in a somewhat random order, but will do it very fast. The class ignores most of the hslf classes, and doesn't use HSLFSlideShow. Instead, it just does a very basic scan through the file, grabbing all the text records as it goes. It then returns the text, either as a single string, or as a vector of all the individual strings. Because of how it works, it will return a lot of "crud" text that you probably didn't want! It will return text from master slides. It will return duplicate text, and some mangled text (powerpoint files often have duplicate copies of slide text in them). You don't get any idea what the text was associated with. Almost everyone will want to use @see PowerPointExtractor instead. There are only a very small number of cases (eg some performance sensitive lucene indexers) that would ever want to use this!

Author:
Nick Burch

Constructor Summary
QuickButCruddyTextExtractor(java.io.InputStream iStream)
          Creates an extractor from a given input stream
QuickButCruddyTextExtractor(POIFSFileSystem poifs)
          Creates an extractor from a POIFS Filesystem
QuickButCruddyTextExtractor(java.lang.String fileName)
          Creates an extractor from a given file name
 
Method Summary
 void close()
          Shuts down the underlying streams
 int findTextRecords(int startPos, java.util.Vector<java.lang.String> textV)
          For the given position, look if the record is a text record, and wind on after.
 java.lang.String getTextAsString()
          Fetches the ALL the text of the powerpoint file, as a single string
 java.util.Vector<java.lang.String> getTextAsVector()
          Fetches the ALL the text of the powerpoint file, in a vector of strings, one per text record
static void main(java.lang.String[] args)
          Really basic text extractor, that will also return lots of crud text.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

QuickButCruddyTextExtractor

public QuickButCruddyTextExtractor(java.lang.String fileName)
                            throws java.io.IOException
Creates an extractor from a given file name

Parameters:
fileName -
Throws:
java.io.IOException

QuickButCruddyTextExtractor

public QuickButCruddyTextExtractor(java.io.InputStream iStream)
                            throws java.io.IOException
Creates an extractor from a given input stream

Parameters:
iStream -
Throws:
java.io.IOException

QuickButCruddyTextExtractor

public QuickButCruddyTextExtractor(POIFSFileSystem poifs)
                            throws java.io.IOException
Creates an extractor from a POIFS Filesystem

Parameters:
poifs -
Throws:
java.io.IOException
Method Detail

main

public static void main(java.lang.String[] args)
                 throws java.io.IOException
Really basic text extractor, that will also return lots of crud text. Takes a single argument, the file to extract from

Throws:
java.io.IOException

close

public void close()
           throws java.io.IOException
Shuts down the underlying streams

Throws:
java.io.IOException

getTextAsString

public java.lang.String getTextAsString()
Fetches the ALL the text of the powerpoint file, as a single string


getTextAsVector

public java.util.Vector<java.lang.String> getTextAsVector()
Fetches the ALL the text of the powerpoint file, in a vector of strings, one per text record


findTextRecords

public int findTextRecords(int startPos,
                           java.util.Vector<java.lang.String> textV)
For the given position, look if the record is a text record, and wind on after. If it is a text record, grabs out the text. Whatever happens, returns the position of the next record, or -1 if no more.



Copyright 2012 The Apache Software Foundation or its licensors, as applicable.