org.apache.lucene.analysis.cn
Class ChineseTokenizer
java.lang.Object
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.cn.ChineseTokenizer
public final class ChineseTokenizer
- extends Tokenizer
Title: ChineseTokenizer
Description: Extract tokens from the Stream using Character.getType()
Rule: A Chinese character as a single token
Copyright: Copyright (c) 2001
Company:
The difference between thr ChineseTokenizer and the
CJKTokenizer (id=23545) is that they have different
token parsing logic.
Let me use an example. If having a Chinese text
"C1C2C3C4" to be indexed, the tokens returned from the
ChineseTokenizer are C1, C2, C3, C4. And the tokens
returned from the CJKTokenizer are C1C2, C2C3, C3C4.
Therefore the index the CJKTokenizer created is much
larger.
The problem is that when searching for C1, C1C2, C1C3,
C4C2, C1C2C3 ... the ChineseTokenizer works, but the
CJKTokenizer will not work.
- Version:
- 1.0
- Author:
- Yiyi Sun
Fields inherited from class org.apache.lucene.analysis.Tokenizer |
input |
Method Summary |
Token |
next()
Returns the next token in the stream, or null at EOS. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
ChineseTokenizer
public ChineseTokenizer(Reader in)
next
public final Token next()
throws IOException
- Description copied from class:
TokenStream
- Returns the next token in the stream, or null at EOS.
The returned Token is a "full private copy" (not
re-used across calls to next()) but will be slower
than calling
TokenStream.next(Token)
instead..
- Overrides:
next
in class TokenStream
- Throws:
IOException
Copyright © 2000-2008 Apache Software Foundation. All Rights Reserved.