package

java.util.regex

Classes | Description

Provides an implementation of regular expressions, which is useful for matching, searching, and replacing strings based on patterns. The two fundamental classes are Pattern and Matcher. The former takes a pattern described by means of a regular expression and compiles it into a special internal representation. The latter matches the compiled pattern against a given input.

Regular expressions

A regular expression consists of literal text, meta characters, character sets, and operators. The latter three have a special meaning when encountered during the processing of a pattern.
  • Meta characters are a special means to describe single characters in the input text. A common example for a meta character is the dot '.', which, when used in a regular expression, matches any character.
  • Character sets are a convenient means to describe different characters that match a single character in the input. Character sets are enclosed in angular brackets '[' and ']' and use the dash '-' for forming ranges. A typical example is "[0-9a-fA-F]", which describes the set of all hexadecimal digits.
  • Operators modify or combine whole regular expressions, with the result being a regular expression again. An example for an operator is the asterisk '*', which, together with the regular expression preceding it, matches zero or more repetitions of that regular expression. The plus sign '+' is similar, but requires at least one occurrence.
Meta characters, the '[' and ']' that form a character set, and operators normally lose their special meaning when preceded by a backslash '\'. To get a backslash by itself, use a double backslash. Note that when using regular expressions in Java source code, some care has to be taken to get the backslashes right (due to yet another level of escaping being necessary for Java).

The following table gives some basic examples of regular expressions and input strings that match them:

Regular expression Matched string(s)
"Hello, World!" "Hello, World!"
"Hello, World." "Hello, World!", "Hello, World?"
"Hello, .*d!" "Hello, World!", "Hello, Android!", "Hello, Dad!"
"[0-9]+ green bottles" "0 green bottles", "25 green bottles", "1234 green bottles"

The following section describe the various features in detail. The are also some implementation notes at the end.

Meta characters

The following two tables lists the meta characters understood in regular expressions.

Meta character Description
\a Match a BELL, \u0007.
\A Match at the beginning of the input. Differs from ^ in that \A will not match after a new line within the input.
\b, outside of a character set Match if the current position is a word boundary. Boundaries occur at the transitions between word (\w) and non-word (\W) characters, with combining marks ignored.
\b, within a character set Match a BACKSPACE, \u0008.
\B Match if the current position is not a word boundary.
\cX Match a control-X character (replace X with actual character).
\e Match an ESCAPE, \u001B.
\E Ends quoting started by \Q. Meta characters, character classes, and operators become active again.
\f Match a FORM FEED, \u000C.
\G Match if the current position is at the end of the previous match.
\n Match a LINE FEED, \u000A.
\N{UNICODE CHARACTER NAME} Match the named Unicode character.
\Q Quotes all following characters until \E. The following text is treated as literal.
\r Match a CARRIAGE RETURN, \u000D.
\t Match a HORIZONTAL TABULATION, \u0009.
\uhhhh Match the character with the hex value hhhh.
\Uhhhhhhhh Match the character with the hex value hhhhhhhh. Exactly eight hex digits must be provided, even though the largest Unicode code point is \U0010ffff.
\x{hhhh} Match the character with the hex value hhhh. From one to six hex digits may be supplied.
\xhh Match the character with the hex value hh.
\Z Match if the current position is at the end of input, but before the final line terminator, if one exists.
\z Match if the current position is at the end of input.
\0n, \0nn, \0nnn Match the character with the octal value n, nn, or nnn. Maximum value is 0377.
\n Back Reference. Match whatever the nth capturing group matched. n must be a number > 1 and < total number of capture groups in the pattern. Note: Octal escapes, such as \012, are not supported in ICU regular expressions
[character set] Match any one character from the character set. See character sets for a full description of what may appear between the angular brackets.
. Match any character.
^ Match at the beginning of a line.
$ Match at the end of a line.
\ Quotes the following character, so that is loses any special meaning it might have.

Character sets

The following table lists the syntax elements allowed inside a character set:

Element Description
[a] The character set consisting of the letter 'a' only.
[xyz] The character set consisting of the letters 'x', 'y', and 'z', described by explicit enumeration.
[x-z] The character set consisting of the letters 'x', 'y', and 'z', described by means of a range.
[^xyz] The character set consisting of everything but the letters 'x', 'y', and 'z'.
[[a-f][0-9]] The character set formed by building the union of the two character sets [a-f] and [0-9].
[[a-z]&&[jkl]] The character set formed by building the intersection of the two character sets [a-z] and [jkl]. You can also use a single '&', but this regular expression might not be portable.
[[a-z]--[jkl]] The character set formed by building the difference of the two character sets [a-z] and [jkl]. You can also use a single '-'. This operator is generally not portable.

A couple of frequently used character sets are predefined and named. These can be referenced by their name, but behave otherwise similar to explicit character sets. The following table lists them:

Character set Description
\d, \D The set consisting of all digit characters (\d) or the opposite of it (\D).
\s, \S The set consisting of all space characters (\s) or the opposite of it (\S).
\w, \W The set consisting of all word characters (\w) or the opposite of it (\W).
\X The set of all grapheme clusters.
\p{NAME}, \P{NAME} The Posix set with the specified NAME (\p{}) or the opposite of it (\P{}) - Legal values for NAME are 'Alnum', 'Alpha', 'ASCII', 'Blank', 'Cntrl', 'Digit', 'Graph', 'Lower', 'Print', 'Punct', 'Upper', 'XDigit' .
\p{inBLOCK}, \P{inBLOCK} The character set equivalent to the given Unicode BLOCK (\p{}) or the opposite of it (\P{}). An example for a legal BLOCK name is 'Hebrew', meaning, unsurprisingly, all Hebrew characters.
\p{CATEGORY}, \P{CATEGORY} The character set equivalent to the Unicode CATEGORY (\p{}) or the opposite of it (\P{}). An example for a legal CATEGORY name is 'Lu', meaning all uppercase letters.
\p{javaMETHOD}, \P{javaMETHOD} The character set equivalent to the isMETHOD() operation of the Character class (\p{}) or the opposite of it (\P{}).

Operators

The following table lists the operators understood inside regular expressions:

Operator Description
| Alternation. A|B matches either A or B.
* Match 0 or more times. Match as many times as possible.
+ Match 1 or more times. Match as many times as possible.
? Match zero or one times. Prefer one.
{n} Match exactly n times
{n,} Match at least n times. Match as many times as possible.
{n,m} Match between n and m times. Match as many times as possible, but not more than m.
*? Match 0 or more times. Match as few times as possible.
+? Match 1 or more times. Match as few times as possible.
?? Match zero or one times. Prefer zero.
{n}? Match exactly n times.
{n,}? Match at least n times, but no more than required for an overall pattern match
{n,m}? Match between n and m times. Match as few times as possible, but not less than n.
*+ Match 0 or more times. Match as many times as possible when first encountered, do not retry with fewer even if overall match fails (Possessive Match)
++ Match 1 or more times. Possessive match.
?+ Match zero or one times. Possessive match.
{n}+ Match exactly n times.
{n,}+ Match at least n times. Possessive Match.
{n,m}+ Match between n and m times. Possessive Match.
( ... ) Capturing parentheses. Range of input that matched the parenthesized subexpression is available after the match.
(?: ... ) Non-capturing parentheses. Groups the included pattern, but does not provide capturing of matching text. Somewhat more efficient than capturing parentheses.
(?> ... ) Atomic-match parentheses. First match of the parenthesized subexpression is the only one tried; if it does not lead to an overall pattern match, back up the search for a match to a position before the "(?>"
(?# ... ) Free-format comment (?# comment ).
(?= ... ) Look-ahead assertion. True if the parenthesized pattern matches at the current input position, but does not advance the input position.
(?! ... ) Negative look-ahead assertion. True if the parenthesized pattern does not match at the current input position. Does not advance the input position.
(?<= ... ) Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.)
(?<! ... ) Negative Look-behind assertion. True if the parenthesized pattern does not match text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.)
(?ismwx-ismwx: ... ) Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled.
(?ismwx-ismwx) Flag settings. Change the flag settings. Changes apply to the portion of the pattern following the setting. For example, (?i) changes to a case insensitive match.

Implementation notes

The regular expression implementation used in Android is provided by ICU. The notation for the regular expressions is mostly a superset of those used in other Java language implementations. This means that existing applications will normally work as expected, but in rare cases some regular expression content that is meant to be literal might be interpreted with a special meaning. The most notable examples are the single '&', which can also be used as the intersection operator for character sets, and the intersection operators '-' and '--'. Also, some of the flags are handled in a slightly different way:
  • The CASE_INSENSITIVE flag silently assumes Unicode case-insensitivity. That is, the UNICODE_CASE flag is effectively a no-op.
  • The CANON_EQ flag is not supported at all (throws an exception).