Users are also strongly encouraged to read the section on String Search and Collation in the user guide before attempting to use this class.
String searching becomes a little complicated when accents are encountered at match boundaries. If a match is found and it has preceding or trailing accents not part of the match, the result returned will include the preceding accents up to the first base character, if the pattern searched for starts an accent. Likewise, if the pattern ends with an accent, all trailing accents up to the first base character will be included in the result.
For example, if a match is found in target text "a\u0325\u0300" for the pattern "a\u0325", the result returned by StringSearch will be the index 0 and length 3 <0, 3>. If a match is found in the target "a\u0325\u0300" for the pattern "\u0300", then the result will be index 1 and length 2 <1, 2>.
In the case where the decomposition mode is on for the RuleBasedCollator, all matches that starts or ends with an accent will have its results include preceding or following accents respectively. For example, if pattern "a" is looked for in the target text "á\u0325", the result will be index 0 and length 2 <0, 2>.
The StringSearch class provides two options to handle accent matching described below:
Let S' be the sub-string of a text string S between the offsets start and end <start, end>.
A pattern string P matches a text string S at the offsets <start, length>
if
option 1. P matches some canonical equivalent string of S'. Suppose the RuleBasedCollator used for searching has a collation strength of TERTIARY, all accents are non-ignorable. If the pattern "a\u0300" is searched in the target text "a\u0325\u0300", a match will be found, since the target text is canonically equivalent to "a\u0300\u0325" option 2. P matches S' and if P starts or ends with a combining mark, there exists no non-ignorable combining mark before or after S' in S respectively. Following the example above, the pattern "a\u0300" will not find a match in "a\u0325\u0300", since there exists a non-ignorable accent '\u0325' in the middle of 'a' and '\u0300'. Even with a target text of "a\u0300\u0325" a match will not be found because of the non-ignorable trailing accent \u0325.Option 2. will be the default mode for dealing with boundary accents unless specified via the API setCanonical(boolean). One restriction is to be noted for option 1. Currently there are no composite characters that consists of a character with combining class > 0 before a character with combining class == 0. However, if such a character exists in the future, the StringSearch may not work correctly with option 1 when such characters are encountered.
SearchIterator provides APIs to specify the starting position within the text string to be searched, e.g. setIndex, preceding and following. Since the starting position will be set as it is specified, please take note that there are some dangerous positions which the search may render incorrect results:
Though collator attributes will be taken into consideration while performing matches, there are no APIs provided in StringSearch for setting and getting the attributes. These attributes can be set by getting the collator from getCollator and using the APIs in com.ibm.icu.text.Collator. To update StringSearch to the new collator attributes, reset() or setCollator(RuleBasedCollator) has to be called.
Consult the String Search user guide and the SearchIterator
documentation for more information and examples of use.
This class is not subclassable
@see SearchIterator @see RuleBasedCollator @author Laura Werner, synwee @stable ICU 2.0
|
|