|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.apache.lucene.analysis.Token
public class Token
A Token is an occurence of a term from the text of a field. It consists of a term's text, the start and end offset of the term in the text of the field, and a type string.
The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC (KeyWord In Context) display, etc.
The type is an interned string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might be implemented with type "eos". The default token type is "word".
A Token can optionally have metadata (a.k.a. Payload) in the form of a variable
length byte array. Use TermPositions.getPayloadLength()
and
TermPositions.getPayload(byte[], int)
to retrieve the payloads from the index.
WARNING: The status of the Payloads feature is experimental.
The APIs introduced here might change in the future and will not be
supported anymore in such a case.
NOTE: As of 2.3, Token stores the term text
internally as a malleable char[] termBuffer instead of
String termText. The indexing code and core tokenizers
have been changed re-use a single Token instance, changing
its buffer and other fields in-place as the Token is
processed. This provides substantially better indexing
performance as it saves the GC cost of new'ing a Token and
String for every term. The APIs that accept String
termText are still available but a warning about the
associated performance cost has been added (below). The
termText()
method has been deprecated.
Tokenizers and filters should try to re-use a Token
instance when possible for best performance, by
implementing the TokenStream.next(Token)
API.
Failing that, to create a new Token you should first use
one of the constructors that starts with null text. Then
you should call either termBuffer()
or resizeTermBuffer(int)
to retrieve the Token's
termBuffer. Fill in the characters of your term into this
buffer, and finally call setTermLength(int)
to
set the length of the term text. See LUCENE-969
for details.
Payload
Field Summary | |
---|---|
static String |
DEFAULT_TYPE
|
Constructor Summary | |
---|---|
Token()
Constructs a Token will null text. |
|
Token(int start,
int end)
Constructs a Token with null text and start & end offsets. |
|
Token(int start,
int end,
String typ)
Constructs a Token with null text and start & end offsets plus the Token type. |
|
Token(String text,
int start,
int end)
Constructs a Token with the given term text, and start & end offsets. |
|
Token(String text,
int start,
int end,
String typ)
Constructs a Token with the given text, start and end offsets, & type. |
Method Summary | |
---|---|
void |
clear()
Resets the term text, payload, and positionIncrement to default. |
Object |
clone()
|
int |
endOffset()
Returns this Token's ending offset, one greater than the position of the last character corresponding to this token in the source text. |
Payload |
getPayload()
Returns this Token's payload. |
int |
getPositionIncrement()
Returns the position increment of this Token. |
char[] |
resizeTermBuffer(int newSize)
Grows the termBuffer to at least size newSize. |
void |
setEndOffset(int offset)
Set the ending offset. |
void |
setPayload(Payload payload)
Sets this Token's payload. |
void |
setPositionIncrement(int positionIncrement)
Set the position increment. |
void |
setStartOffset(int offset)
Set the starting offset. |
void |
setTermBuffer(char[] buffer,
int offset,
int length)
Copies the contents of buffer, starting at offset for length characters, into the termBuffer array. |
void |
setTermLength(int length)
Set number of valid characters (length of the term) in the termBuffer array. |
void |
setTermText(String text)
Sets the Token's term text. |
void |
setType(String type)
Set the lexical type. |
int |
startOffset()
Returns this Token's starting offset, the position of the first character corresponding to this token in the source text. |
char[] |
termBuffer()
Returns the internal termBuffer character array which you can then directly alter. |
int |
termLength()
Return number of valid characters (length of the term) in the termBuffer array. |
String |
termText()
Deprecated. Use termBuffer() and termLength() instead. |
String |
toString()
|
String |
type()
Returns this Token's lexical type. |
Methods inherited from class java.lang.Object |
---|
equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
public static final String DEFAULT_TYPE
Constructor Detail |
---|
public Token()
public Token(int start, int end)
start
- start offsetend
- end offsetpublic Token(int start, int end, String typ)
start
- start offsetend
- end offsetpublic Token(String text, int start, int end)
text
- term textstart
- start offsetend
- end offsetpublic Token(String text, int start, int end, String typ)
text
- term textstart
- start offsetend
- end offsettyp
- token typeMethod Detail |
---|
public void setPositionIncrement(int positionIncrement)
TokenStream
, used in phrase
searching.
The default value is one.
Some common uses for this are:
TermPositions
public int getPositionIncrement()
setPositionIncrement(int)
public void setTermText(String text)
public final String termText()
termBuffer()
and termLength()
instead.
public final void setTermBuffer(char[] buffer, int offset, int length)
termBuffer()
or resizeTermBuffer(int)
, and
fill it in directly to set the term text. This saves
an extra copy.
public final char[] termBuffer()
resizeTermBuffer(int)
to increase it. After
altering the buffer be sure to call setTermLength(int)
to record the number of valid
characters that were placed into the termBuffer.
public char[] resizeTermBuffer(int newSize)
newSize
- minimum size of the new termBuffer
public final int termLength()
public final void setTermLength(int length)
public final int startOffset()
public void setStartOffset(int offset)
startOffset()
public final int endOffset()
public void setEndOffset(int offset)
endOffset()
public final String type()
public final void setType(String type)
type()
public Payload getPayload()
public void setPayload(Payload payload)
public String toString()
toString
in class Object
public void clear()
public Object clone()
clone
in class Object
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |