Patterns specify individual characters, ranges of characters, and Unicode property sets. When elements are concatenated, they specify their union. To complement a set, place a '^' immediately after the opening '['. Property patterns are inverted by modifying their delimiters; "[:^foo]" and "\P{foo}". In any other location, '^' has no special meaning.
Ranges are indicated by placing two a '-' between two characters, as in "a-z". This specifies the range of all characters from the left to the right, in Unicode order. If the left character is greater than or equal to the right character it is a syntax error. If a '-' occurs as the first character after the opening '[' or '[^', or if it occurs as the last character before the closing ']', then it is taken as a literal. Thus "[a\\-b]", "[-ab]", and "[ab-]" all indicate the same set of three characters, 'a', 'b', and '-'.
Sets may be intersected using the '&' operator or the asymmetric set difference may be taken using the '-' operator, for example, "[[:L:]&[\\u0000-\\u0FFF]]" indicates the set of all Unicode letters with values less than 4096. Operators ('&' and '|') have equal precedence and bind left-to-right. Thus "[[:L:]-[a-z]-[\\u0100-\\u01FF]]" is equivalent to "[[[:L:]-[a-z]]-[\\u0100-\\u01FF]]". This only really matters for difference; intersection is commutative.
[a] | The set containing 'a' |
[a-z] | The set containing 'a' through 'z' and all letters in between, in Unicode order |
[^a-z] | The set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+10FFFF |
[[pat1][pat2]] | The union of sets specified by pat1 and pat2 |
[[pat1]&[pat2]] | The intersection of sets specified by pat1 and pat2 |
[[pat1]-[pat2]] | The asymmetric difference of sets specified by pat1 and pat2 |
[:Lu:] or \p{Lu} | The set of characters having the specified Unicode property; in this case, Unicode uppercase letters |
[:^Lu:] or \P{Lu} | The set of characters not having the given Unicode property |
Warning: you cannot add an empty string ("") to a UnicodeSet.
Formal syntax
@author Alan Liu @stable ICU 2.0
pattern :=
('[' '^'? item* ']') | property
item :=
char | (char '-' char) | pattern-expr
pattern-expr :=
pattern | pattern-expr pattern | pattern-expr op pattern
op :=
'&' | '-'
special :=
'[' | ']' | '-'
char :=
any character that is not special
any character
| ('\\')
| ('\u' hex hex hex hex)
hex :=
any character for which Character.digit(c, 16)
returns a non-negative resultproperty :=
a Unicode property set pattern
Legend:
a := b
a
may be replaced byb
a?
zero or one instance of a
a*
one or more instances of a
a | b
either a
orb
'a'
the literal string between the quotes
|
|