public abstract class

BreakIterator

extends Object
implements Cloneable

java.lang.Object
↳	java.text.BreakIterator

Class Overview

Locates boundaries in text. This class defines a protocol for objects that break up a piece of natural-language text according to a set of criteria. Instances or subclasses of BreakIterator can be provided, for example, to break a piece of text into words, sentences, or logical characters according to the conventions of some language or group of languages. We provide four built-in types of BreakIterator:

getSentenceInstance() returns a BreakIterator that locates boundaries between sentences. This is useful for triple-click selection, for example.
getWordInstance() returns a BreakIterator that locates boundaries between words. This is useful for double-click selection or "find whole words" searches. This type of BreakIterator makes sure there is a boundary position at the beginning and end of each legal word (numbers count as words, too). Whitespace and punctuation are kept separate from real words.
getLineInstance() returns a BreakIterator that locates positions where it is legal for a text editor to wrap lines. This is similar to word breaking, but not the same: punctuation and whitespace are generally kept with words (you don't want a line to start with whitespace, for example), and some special characters can force a position to be considered a line break position or prevent a position from being a line break position.
getCharacterInstance() returns a BreakIterator that locates boundaries between logical characters. Because of the structure of the Unicode encoding, a logical character may be stored internally as more than one Unicode code point. (A with an umlaut may be stored as an a followed by a separate combining umlaut character, for example, but the user still thinks of it as one character.) This iterator allows various processes (especially text editors) to treat as characters the units of text that a user would think of as characters, rather than the units of text that the computer sees as "characters".

BreakIterator's interface follows an "iterator" model (hence the name), meaning it has a concept of a "current position" and methods like first(), last(), next(), and previous() that update the current position. All BreakIterators uphold the following invariants:

The beginning and end of the text are always treated as boundary positions.
The current position of the iterator is always a boundary position (random- access methods move the iterator to the nearest boundary position before or after the specified position, not to the specified position).
DONE is used as a flag to indicate when iteration has stopped. DONE is only returned when the current position is the end of the text and the user calls next(), or when the current position is the beginning of the text and the user calls previous().
Break positions are numbered by the positions of the characters that follow them. Thus, under normal circumstances, the position before the first character is 0, the position after the first character is 1, and the position after the last character is 1 plus the length of the string.
The client can change the position of an iterator, or the text it analyzes, at will, but cannot change the behavior. If the user wants different behavior, he must instantiate a new iterator.

BreakIterator accesses the text it analyzes through a CharacterIterator, which makes it possible to use BreakIterator to analyze text in any text-storage vehicle that provides a CharacterIterator interface.

Note: Some types of BreakIterator can take a long time to create, and instances of BreakIterator are not currently cached by the system. For optimal performance, keep instances of BreakIterator around as long as it makes sense. For example, when word-wrapping a document, don't create and destroy a new BreakIterator for each line. Create one break iterator for the whole document (or whatever stretch of text you're wrapping) and use it to do the whole job of wrapping the text.

Examples:

Creating and using text boundaries:

 public static void main(String args[]) {
     if (args.length == 1) {
         String stringToExamine = args[0];
         //print each word in order
         BreakIterator boundary = BreakIterator.getWordInstance();
         boundary.setText(stringToExamine);
         printEachForward(boundary, stringToExamine);
         //print each sentence in reverse order
         boundary = BreakIterator.getSentenceInstance(Locale.US);
         boundary.setText(stringToExamine);
         printEachBackward(boundary, stringToExamine);
         printFirst(boundary, stringToExamine);
         printLast(boundary, stringToExamine);
     }
 }

Print each element in order:

 public static void printEachForward(BreakIterator boundary, String source) {
     int start = boundary.first();
     for (int end = boundary.next(); end != BreakIterator.DONE; start = end, end = boundary.next()) {
         System.out.println(source.substring(start, end));
     }
 }

Print each element in reverse order:

 public static void printEachBackward(BreakIterator boundary, String source) {
     int end = boundary.last();
     for (int start = boundary.previous(); start != BreakIterator.DONE; end = start, start = boundary
             .previous()) {
         System.out.println(source.substring(start, end));
     }
 }

Print the first element:

 public static void printFirst(BreakIterator boundary, String source) {
     int start = boundary.first();
     int end = boundary.next();
     System.out.println(source.substring(start, end));
 }

Print the last element:

 public static void printLast(BreakIterator boundary, String source) {
     int end = boundary.last();
     int start = boundary.previous();
     System.out.println(source.substring(start, end));
 }

Print the element at a specified position:

 public static void printAt(BreakIterator boundary, int pos, String source) {
     int end = boundary.following(pos);
     int start = boundary.previous();
     System.out.println(source.substring(start, end));
 }

Find the next word:

 public static int nextWordStartAfter(int pos, String text) {
     BreakIterator wb = BreakIterator.getWordInstance();
     wb.setText(text);
     int last = wb.following(pos);
     int current = wb.next();
     while (current != BreakIterator.DONE) {
         for (int p = last; p < current; p++) {
             if (Character.isLetter(text.charAt(p)))
                 return last;
         }
         last = current;
         current = wb.next();
     }
     return BreakIterator.DONE;
 }

The iterator returned by BreakIterator.getWordInstance() is unique in that the break positions it returns don't represent both the start and end of the thing being iterated over. That is, a sentence-break iterator returns breaks that each represent the end of one sentence and the beginning of the next. With the word-break iterator, the characters between two boundaries might be a word, or they might be the punctuation or whitespace between two words. The above code uses a simple heuristic to determine which boundary is the beginning of a word: If the characters between this boundary and the next boundary include at least one letter (this can be an alphabetical letter, a CJK ideograph, a Hangul syllable, a Kana character, etc.), then the text between this boundary and the next is a word; otherwise, it's the material between words.)

Summary

Constants
int	DONE	This constant is returned by iterate methods like `previous()` or `next()` if they have returned all valid boundaries.

Protected Constructors
	BreakIterator() Default constructor, just for invocation by a subclass.

Public Methods
Object	clone() Creates a copy of this iterator, all status information including the current position are kept the same.
abstract int	current() Returns this iterator's current position.
abstract int	first() Sets this iterator's current position to the first boundary and returns that position.
abstract int	following(int offset) Sets the position of the first boundary to the one following the given offset and returns this position.
static Locale[]	getAvailableLocales() Returns all supported locales in an array.
static BreakIterator	getCharacterInstance() Returns a new instance of `BreakIterator` to iterate over characters using the default locale.
static BreakIterator	getCharacterInstance(Locale where) Returns a new instance of `BreakIterator` to iterate over characters using the given locale.
static BreakIterator	getLineInstance() Returns a new instance of {`BreakIterator` to iterate over line breaks using the default locale.
static BreakIterator	getLineInstance(Locale where) Returns a new instance of `BreakIterator` to iterate over line breaks using the given locale.
static BreakIterator	getSentenceInstance(Locale where) Returns a new instance of `BreakIterator` to iterate over sentence-breaks using the given locale.
static BreakIterator	getSentenceInstance() Returns a new instance of `BreakIterator` to iterate over sentence-breaks using the default locale.
abstract CharacterIterator	getText() Returns a `CharacterIterator` which represents the text being analyzed.
static BreakIterator	getWordInstance() Returns a new instance of `BreakIterator` to iterate over word-breaks using the default locale.
static BreakIterator	getWordInstance(Locale where) Returns a new instance of `BreakIterator` to iterate over word-breaks using the given locale.
boolean	isBoundary(int offset) Indicates whether the given offset is a boundary position.
abstract int	last() Sets this iterator's current position to the last boundary and returns that position.
abstract int	next() Sets this iterator's current position to the next boundary after the current position, and returns this position.
abstract int	next(int n) Sets this iterator's current position to the next boundary after the given position, and returns that position.
int	preceding(int offset) Returns the position of last boundary preceding the given offset, and sets the current position to the returned value, or `DONE` if the given offset specifies the starting position.
abstract int	previous() Sets this iterator's current position to the previous boundary before the current position and returns that position.
void	setText(String newText) Sets the new text string to be analyzed, the current position will be reset to the beginning of this new string, and the old string will be lost.
abstract void	setText(CharacterIterator newText) Sets the new text to be analyzed by the given `CharacterIterator`.

Protected Methods
static int	getInt(byte[] buf, int offset) Gets an int value from the given byte array, starting from the given offset.
static long	getLong(byte[] buf, int offset) Gets a long value from the given byte array, starting from the given offset.
static short	getShort(byte[] buf, int offset) Gets a short value from the given byte array, starting from the given offset.

[Expand]

Inherited Methods

From class java.lang.Object

Object	clone() Creates and returns a copy of this `Object`.
boolean	equals(Object o) Compares this instance with the specified object and indicates if they are equal.
void	finalize() Is called before the object's memory is being reclaimed by the VM.
final Class<? extends Object>	getClass() Returns the unique instance of Class which represents this object's class.
int	hashCode() Returns an integer hash code for this object.
final void	notify() Causes a thread which is waiting on this object's monitor (by means of calling one of the `wait()` methods) to be woken up.
final void	notifyAll() Causes all threads which are waiting on this object's monitor (by means of calling one of the `wait()` methods) to be woken up.
String	toString() Returns a string containing a concise, human-readable description of this object.
final void	wait(long millis, int nanos) Causes the calling thread to wait until another thread calls the `notify()` or `notifyAll()` method of this object or until the specified timeout expires.
final void	wait(long millis) Causes the calling thread to wait until another thread calls the `notify()` or `notifyAll()` method of this object or until the specified timeout expires.
final void	wait() Causes the calling thread to wait until another thread calls the `notify()` or `notifyAll()` method of this object.

Constants

public static final int DONE

This constant is returned by iterate methods like previous() or next() if they have returned all valid boundaries.

Constant Value: -1 (0xffffffff)

Protected Constructors

protected BreakIterator ()

Default constructor, just for invocation by a subclass.

Public Methods

public Object clone ()

Creates a copy of this iterator, all status information including the current position are kept the same.

Returns

a copy of this iterator.

public abstract int current ()

Returns this iterator's current position.

Returns

this iterator's current position.

public abstract int first ()

Sets this iterator's current position to the first boundary and returns that position.

Returns

the position of the first boundary.

public abstract int following (int offset)

Sets the position of the first boundary to the one following the given offset and returns this position. Returns DONE if there is no boundary after the given offset.

Parameters

offset	the given position to be searched for.

Returns

the position of the first boundary following the given offset.

public static Locale[] getAvailableLocales ()

Returns all supported locales in an array.

Returns

all supported locales.

public static BreakIterator getCharacterInstance ()

Returns a new instance of BreakIterator to iterate over characters using the default locale.

Returns

a new instance of BreakIterator using the default locale.

public static BreakIterator getCharacterInstance (Locale where)

Returns a new instance of BreakIterator to iterate over characters using the given locale.

Parameters

where	the given locale.

Returns

a new instance of BreakIterator using the given locale.

public static BreakIterator getLineInstance ()

Returns a new instance of {BreakIterator to iterate over line breaks using the default locale.

Returns

a new instance of BreakIterator using the default locale.

public static BreakIterator getLineInstance (Locale where)

Returns a new instance of BreakIterator to iterate over line breaks using the given locale.

Parameters

where	the given locale.

Returns

a new instance of BreakIterator using the given locale.

Throws

NullPointerException	if `where` is `null`.

public static BreakIterator getSentenceInstance (Locale where)

Returns a new instance of BreakIterator to iterate over sentence-breaks using the given locale.

Parameters

where	the given locale.

Returns

a new instance of BreakIterator using the given locale.

Throws

NullPointerException	if `where` is `null`.

public static BreakIterator getSentenceInstance ()

Returns a new instance of BreakIterator to iterate over sentence-breaks using the default locale.

Returns

a new instance of BreakIterator using the default locale.

public abstract CharacterIterator getText ()

Returns a CharacterIterator which represents the text being analyzed. Please note that the returned value is probably the internal iterator used by this object. If the invoker wants to modify the status of the returned iterator, it is recommended to first create a clone of the iterator returned.

Returns

a CharacterIterator which represents the text being analyzed.

public static BreakIterator getWordInstance ()

Returns a new instance of BreakIterator to iterate over word-breaks using the default locale.

Returns

a new instance of BreakIterator using the default locale.

public static BreakIterator getWordInstance (Locale where)

Returns a new instance of BreakIterator to iterate over word-breaks using the given locale.

Parameters

where	the given locale.

Returns

a new instance of BreakIterator using the given locale.

Throws

NullPointerException	if `where` is `null`.

public boolean isBoundary (int offset)

Indicates whether the given offset is a boundary position. If this method returns true, the current iteration position is set to the given position; if the function returns false, the current iteration position is set as though following(int) had been called.

Parameters

offset	the given offset to check.

Returns

true if the given offset is a boundary position; false otherwise.

public abstract int last ()

Sets this iterator's current position to the last boundary and returns that position.

Returns

the position of last boundary.

public abstract int next ()

Sets this iterator's current position to the next boundary after the current position, and returns this position. Returns DONE if no boundary was found after the current position.

Returns

the position of last boundary.

public abstract int next (int n)

Sets this iterator's current position to the next boundary after the given position, and returns that position. Returns DONE if no boundary was found after the given position.

Parameters

n	the given position.

Returns

the position of last boundary.

public int preceding (int offset)

Returns the position of last boundary preceding the given offset, and sets the current position to the returned value, or DONE if the given offset specifies the starting position.

Parameters

offset	the given start position to be searched for.

Returns

the position of the last boundary preceding the given offset.

public abstract int previous ()

Sets this iterator's current position to the previous boundary before the current position and returns that position. Returns DONE if no boundary was found before the current position.

Returns

the position of last boundary.

public void setText (String newText)

Sets the new text string to be analyzed, the current position will be reset to the beginning of this new string, and the old string will be lost.

Parameters

newText	the new text string to be analyzed.

public abstract void setText (CharacterIterator newText)

Sets the new text to be analyzed by the given CharacterIterator. The position will be reset to the beginning of the new text, and other status information of this iterator will be kept.

Parameters

newText	the `CharacterIterator` referring to the text to be analyzed.

Protected Methods

protected static int getInt (byte[] buf, int offset)

Gets an int value from the given byte array, starting from the given offset.

Parameters

buf	the bytes to be converted.
offset	the start position of the conversion.

Returns

the converted int value.

Throws

NullPointerException	if `buf` is `null`.
ArrayIndexOutOfBoundsException	if `offset < 0` or `offset + INT_LENGTH` is greater than the length of `buf`.

protected static long getLong (byte[] buf, int offset)

Gets a long value from the given byte array, starting from the given offset.

Parameters

buf	the bytes to be converted.
offset	the start position of the conversion.

Returns

the converted long value.

Throws

NullPointerException	if `buf` is `null`.
ArrayIndexOutOfBoundsException	if `offset < 0` or `offset + LONG_LENGTH` is greater than the length of `buf`.

protected static short getShort (byte[] buf, int offset)

Gets a short value from the given byte array, starting from the given offset.

Parameters

buf	the bytes to be converted.
offset	the start position of the conversion.

Returns

the converted short value.

Throws

NullPointerException	if `buf` is `null`.
ArrayIndexOutOfBoundsException	if `offset < 0` or `offset + SHORT_LENGTH` is greater than the length of `buf`.

Interfaces

Classes

Exceptions

BreakIterator

Class Overview

See Also

Summary

Constants

public static final int DONE

Protected Constructors

protected BreakIterator ()

Public Methods

public Object clone ()

Returns

public abstract int current ()

Returns

public abstract int first ()

Returns

public abstract int following (int offset)

Parameters

Returns

public static Locale[] getAvailableLocales ()

Returns

public static BreakIterator getCharacterInstance ()

Returns

public static BreakIterator getCharacterInstance (Locale where)

Parameters

Returns

public static BreakIterator getLineInstance ()

Returns

public static BreakIterator getLineInstance (Locale where)

Parameters

Returns

Throws

public static BreakIterator getSentenceInstance (Locale where)

Parameters

Returns

Throws

public static BreakIterator getSentenceInstance ()

Returns

public abstract CharacterIterator getText ()

Returns

public static BreakIterator getWordInstance ()

Returns

public static BreakIterator getWordInstance (Locale where)

Parameters

Returns

Throws

public boolean isBoundary (int offset)

Parameters

Returns

public abstract int last ()

Returns

public abstract int next ()

Returns

public abstract int next (int n)

Parameters

Returns

public int preceding (int offset)

Parameters

Returns

public abstract int previous ()

Returns

public void setText (String newText)

Parameters

public abstract void setText (CharacterIterator newText)

Parameters

Protected Methods

protected static int getInt (byte[] buf, int offset)

Parameters

Returns

Throws

protected static long getLong (byte[] buf, int offset)

Parameters

Returns

Throws

protected static short getShort (byte[] buf, int offset)

Parameters

Returns

Throws