org.apache.poi.hwpf.extractor
Class WordExtractor

java.lang.Object
  extended by org.apache.poi.POITextExtractor
      extended by org.apache.poi.POIOLE2TextExtractor
          extended by org.apache.poi.hwpf.extractor.WordExtractor

public class WordExtractor
extends POIOLE2TextExtractor

Class to extract the text from a Word Document. You should use either getParagraphText() or getText() unless you have a strong reason otherwise.

Author:
Nick Burch (nick at torchbox dot com)

Field Summary
 
Fields inherited from class org.apache.poi.POITextExtractor
document
 
Constructor Summary
WordExtractor(HWPFDocument doc)
          Create a new Word Extractor
WordExtractor(java.io.InputStream is)
          Create a new Word Extractor
WordExtractor(POIFSFileSystem fs)
          Create a new Word Extractor
 
Method Summary
 java.lang.String[] getParagraphText()
          Get the text from the word file, as an array with one String per paragraph
 java.lang.String getText()
          Grab the text, based on the paragraphs.
 java.lang.String getTextFromPieces()
          Grab the text out of the text pieces.
static void main(java.lang.String[] args)
          Command line extractor, so people will stop moaning that they can't just run this.
 
Methods inherited from class org.apache.poi.POIOLE2TextExtractor
getDocSummaryInformation, getSummaryInformation
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WordExtractor

public WordExtractor(java.io.InputStream is)
              throws java.io.IOException
Create a new Word Extractor

Parameters:
is - InputStream containing the word file
Throws:
java.io.IOException

WordExtractor

public WordExtractor(POIFSFileSystem fs)
              throws java.io.IOException
Create a new Word Extractor

Parameters:
fs - POIFSFileSystem containing the word file
Throws:
java.io.IOException

WordExtractor

public WordExtractor(HWPFDocument doc)
              throws java.io.IOException
Create a new Word Extractor

Parameters:
doc - The HWPFDocument to extract from
Throws:
java.io.IOException
Method Detail

main

public static void main(java.lang.String[] args)
                 throws java.io.IOException
Command line extractor, so people will stop moaning that they can't just run this.

Throws:
java.io.IOException

getParagraphText

public java.lang.String[] getParagraphText()
Get the text from the word file, as an array with one String per paragraph


getTextFromPieces

public java.lang.String getTextFromPieces()
Grab the text out of the text pieces. Might also include various bits of crud, but will work in cases where the text piece -> paragraph mapping is broken. Fast too.


getText

public java.lang.String getText()
Grab the text, based on the paragraphs. Shouldn't include any crud, but slightly slower than getTextFromPieces().

Specified by:
getText in class POITextExtractor
Returns:
All the text from the document


Copyright 2008 The Apache Software Foundation or its licensors, as applicable.