org.apache.lucene.analysis
Class CharTokenizer
java.lang.Object
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.CharTokenizer
- Direct Known Subclasses:
- LetterTokenizer, RussianLetterTokenizer, WhitespaceTokenizer
public abstract class CharTokenizer
- extends Tokenizer
An abstract base class for simple, character-oriented tokenizers.
Fields inherited from class org.apache.lucene.analysis.Tokenizer |
input |
Method Summary |
protected abstract boolean |
isTokenChar(char c)
Returns true iff a character should be included in a token. |
Token |
next(Token reusableToken)
Returns the next token in the stream, or null at EOS. |
protected char |
normalize(char c)
Called on each token character to normalize it before it is added to the
token. |
void |
reset(Reader input)
Expert: Reset the tokenizer to a new reader. |
Methods inherited from class org.apache.lucene.analysis.Tokenizer |
close |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
CharTokenizer
public CharTokenizer(Reader input)
isTokenChar
protected abstract boolean isTokenChar(char c)
- Returns true iff a character should be included in a token. This
tokenizer generates as tokens adjacent sequences of characters which
satisfy this predicate. Characters for which this is false are used to
define token boundaries and are not included in tokens.
normalize
protected char normalize(char c)
- Called on each token character to normalize it before it is added to the
token. The default implementation does nothing. Subclasses may use this
to, e.g., lowercase tokens.
next
public final Token next(Token reusableToken)
throws IOException
- Description copied from class:
TokenStream
- Returns the next token in the stream, or null at EOS.
When possible, the input Token should be used as the
returned Token (this gives fastest tokenization
performance), but this is not required and a new Token
may be returned. Callers may re-use a single Token
instance for successive calls to this method.
This implicitly defines a "contract" between
consumers (callers of this method) and
producers (implementations of this method
that are the source for tokens):
- A consumer must fully consume the previously
returned Token before calling this method again.
- A producer must call
Token.clear()
before setting the fields in it & returning it
Also, the producer must make no assumptions about a
Token after it has been returned: the caller may
arbitrarily change it. If the producer needs to hold
onto the token for subsequent calls, it must clone()
it before storing it.
Note that a TokenFilter
is considered a consumer.
- Overrides:
next
in class TokenStream
- Parameters:
reusableToken
- a Token that may or may not be used to
return; this parameter should never be null (the callee
is not required to check for null before using it, but it is a
good idea to assert that it is not null.)
- Returns:
- next token in the stream or null if end-of-stream was hit
- Throws:
IOException
reset
public void reset(Reader input)
throws IOException
- Description copied from class:
Tokenizer
- Expert: Reset the tokenizer to a new reader. Typically, an
analyzer (in its reusableTokenStream method) will use
this to re-use a previously created tokenizer.
- Overrides:
reset
in class Tokenizer
- Throws:
IOException
Copyright © 2000-2008 Apache Software Foundation. All Rights Reserved.