org.apache.lucene.analysis.cjk
Class CJKTokenizer
java.lang.Object
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.cjk.CJKTokenizer
public final class CJKTokenizer
- extends Tokenizer
CJKTokenizer was modified from StopTokenizer which does a decent job for
most European languages. It performs other token methods for double-byte
Characters: the token will return at each two characters with overlap match.
Example: "java C1C2C3C4" will be segment to: "java" "C1C2" "C2C3" "C3C4" it
also need filter filter zero length token ""
for Digit: digit, '+', '#' will token as letter
for more info on Asia language(Chinese Japanese Korean) text segmentation:
please search google
Fields inherited from class org.apache.lucene.analysis.Tokenizer |
input |
Constructor Summary |
CJKTokenizer(Reader in)
Construct a token stream processing the given input. |
Method Summary |
Token |
next(Token reusableToken)
Returns the next token in the stream, or null at EOS. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
CJKTokenizer
public CJKTokenizer(Reader in)
- Construct a token stream processing the given input.
- Parameters:
in
- I/O reader
next
public final Token next(Token reusableToken)
throws IOException
- Returns the next token in the stream, or null at EOS.
See http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.UnicodeBlock.html
for detail.
- Overrides:
next
in class TokenStream
- Parameters:
reusableToken
- a reusable token
- Returns:
- Token
- Throws:
IOException
- - throw IOException when read error
happened in the InputStream
Copyright © 2000-2008 Apache Software Foundation. All Rights Reserved.