org.apache.lucene.ant
Class HtmlDocument

java.lang.Object
  extended by org.apache.lucene.ant.HtmlDocument

public class HtmlDocument
extends Object

The HtmlDocument class creates a Lucene Document from an HTML document.

It does this by using JTidy package. It can take input input from File or InputStream.

Author:
Erik Hatcher

Constructor Summary
HtmlDocument(File file)
          Constructs an HtmlDocument from a File.
HtmlDocument(InputStream is)
          Constructs an HtmlDocument from an InputStream.
 
Method Summary
static Document Document(File file)
          Creates a Lucene Document from a File.
 String getBody()
          Gets the bodyText attribute of the HtmlDocument object.
static Document getDocument(InputStream is)
          Creates a Lucene Document from an InputStream.
 String getTitle()
          Gets the title attribute of the HtmlDocument object.
static void main(String[] args)
          Runs HtmlDocument on the files specified on the command line.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HtmlDocument

public HtmlDocument(File file)
             throws IOException
Constructs an HtmlDocument from a File.

Parameters:
file - the File containing the HTML to parse
Throws:
IOException - if an I/O exception occurs

HtmlDocument

public HtmlDocument(InputStream is)
Constructs an HtmlDocument from an InputStream.

Parameters:
is - the InputStream containing the HTML
Method Detail

getDocument

public static Document getDocument(InputStream is)
Creates a Lucene Document from an InputStream.

Parameters:
is -

Document

public static Document Document(File file)
                         throws IOException
Creates a Lucene Document from a File.

Parameters:
file -
Throws:
IOException

main

public static void main(String[] args)
                 throws Exception
Runs HtmlDocument on the files specified on the command line.

Parameters:
args - Command line arguments
Throws:
Exception - Description of Exception

getTitle

public String getTitle()
Gets the title attribute of the HtmlDocument object.

Returns:
the title value

getBody

public String getBody()
Gets the bodyText attribute of the HtmlDocument object.

Returns:
the bodyText value


Copyright © 2000-2008 Apache Software Foundation. All Rights Reserved.