org.apache.lucene.analysis.cjk
Class CJKTokenizer

java.lang.Object
  extended by org.apache.lucene.analysis.TokenStream
      extended by org.apache.lucene.analysis.Tokenizer
          extended by org.apache.lucene.analysis.cjk.CJKTokenizer

public final class CJKTokenizer
extends Tokenizer

CJKTokenizer was modified from StopTokenizer which does a decent job for most European languages. It performs other token methods for double-byte Characters: the token will return at each two characters with overlap match.
Example: "java C1C2C3C4" will be segment to: "java" "C1C2" "C2C3" "C3C4" it also need filter filter zero length token ""
for Digit: digit, '+', '#' will token as letter
for more info on Asia language(Chinese Japanese Korean) text segmentation: please search google


Field Summary
 
Fields inherited from class org.apache.lucene.analysis.Tokenizer
input
 
Constructor Summary
CJKTokenizer(Reader in)
          Construct a token stream processing the given input.
 
Method Summary
 Token next(Token reusableToken)
          Returns the next token in the stream, or null at EOS.
 
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, reset
 
Methods inherited from class org.apache.lucene.analysis.TokenStream
next, reset
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CJKTokenizer

public CJKTokenizer(Reader in)
Construct a token stream processing the given input.

Parameters:
in - I/O reader
Method Detail

next

public final Token next(Token reusableToken)
                 throws IOException
Returns the next token in the stream, or null at EOS. See http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.UnicodeBlock.html for detail.

Overrides:
next in class TokenStream
Parameters:
reusableToken - a reusable token
Returns:
Token
Throws:
IOException - - throw IOException when read error
happened in the InputStream


Copyright © 2000-2008 Apache Software Foundation. All Rights Reserved.