Examples of com.ibm.icu.text.UnicodeSet

com.ibm.icu.text.UnicodeSet

cu-project.org/userguide/unicodeSet.html"> http://www.icu-project.org/userguide/unicodeSet.html. Actual determination of property data is defined by the underlying Unicode database as implemented by UCharacter.

Patterns specify individual characters, ranges of characters, and Unicode property sets. When elements are concatenated, they specify their union. To complement a set, place a '^' immediately after the opening '['. Property patterns are inverted by modifying their delimiters; "[:^foo]" and "\P{foo}". In any other location, '^' has no special meaning.

Ranges are indicated by placing two a '-' between two characters, as in "a-z". This specifies the range of all characters from the left to the right, in Unicode order. If the left character is greater than or equal to the right character it is a syntax error. If a '-' occurs as the first character after the opening '[' or '[^', or if it occurs as the last character before the closing ']', then it is taken as a literal. Thus "[a\\-b]", "[-ab]", and "[ab-]" all indicate the same set of three characters, 'a', 'b', and '-'.

Sets may be intersected using the '&' operator or the asymmetric set difference may be taken using the '-' operator, for example, "[[:L:]&[\\u0000-\\u0FFF]]" indicates the set of all Unicode letters with values less than 4096. Operators ('&' and '|') have equal precedence and bind left-to-right. Thus "[[:L:]-[a-z]-[\\u0100-\\u01FF]]" is equivalent to "[[[:L:]-[a-z]]-[\\u0100-\\u01FF]]". This only really matters for difference; intersection is commutative.

`[a]`	The set containing 'a'
`[a-z]`	The set containing 'a' through 'z' and all letters in between, in Unicode order
`[^a-z]`	The set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+10FFFF
`[[pat1][pat2]]`	The union of sets specified by pat1 and pat2
`[[pat1]&[pat2]]`	The intersection of sets specified by pat1 and pat2
`[[pat1]-[pat2]]`	The asymmetric difference of sets specified by pat1 and pat2
`[:Lu:] or \p{Lu}`	The set of characters having the specified Unicode property; in this case, Unicode uppercase letters
`[:^Lu:] or \P{Lu}`	The set of characters not having the given Unicode property

Warning: you cannot add an empty string ("") to a UnicodeSet.

Formal syntax

pattern := ('[' '^'? item* ']') | property

item := char | (char '-' char) | pattern-expr

pattern-expr := pattern | pattern-expr pattern | pattern-expr op pattern

op := '&' | '-'

special := '[' | ']' | '-'

char := any character that is notspecial | ('\\'any character) | ('\u' hex hex hex hex)

hex := any character for which Character.digit(c, 16) returns a non-negative result

property := a Unicode property set pattern

Legend:

a := b a may be replaced by b

a? zero or one instance of a

a* one or more instances of a

a | b either a or b

'a' the literal string between the quotes

To iterate over contents of UnicodeSet, use UnicodeSetIterator class. @author Alan Liu @stable ICU 2.0 @see UnicodeSetIterator

    // ===== PRIVATES =====


    private int processSet(String regex, int i, StringBuffer result, UnicodeSet temp, ParsePosition pos) {
        try {
            pos.setIndex(i);
            UnicodeSet x = temp.clear().applyPattern(regex, pos, null, 0);
            x.complement().complement(); // hack to fix toPattern
            result.append(x.toPattern(false));
            i = pos.getIndex() - 1; // allow for the loop increment
            return i;
        } catch (Exception e) {
            throw (IllegalArgumentException) new IllegalArgumentException("Error in " + regex).initCause(e);
        }

View Full Code Here

     * @param string
     * @return
     */
    public String quoteLiteral(String string) {
        if (needingQuoteCharacters == null) {
            needingQuoteCharacters = new UnicodeSet().addAll(syntaxCharacters).addAll(ignorableCharacters).addAll(extraQuotingCharacters); // .addAll(quoteCharacters)
            if (usingSlash) needingQuoteCharacters.add(BACK_SLASH);
            if (usingQuote) needingQuoteCharacters.add(SINGLE_QUOTE);
        }
        StringBuffer result = new StringBuffer();
        int quotedChar = NO_QUOTE;

View Full Code Here

    // This is pretty thoroughly tested by checkCanonicalRep()
    // run against the exhaustive operation results.  Use the code
    // here for debugging specific spot problems.
    
    // 1 overlap against 2
    UnicodeSet set = new UnicodeSet("[h-km-q]");
    UnicodeSet set2 = new UnicodeSet("[i-o]");
    set.addAll(set2);
    expectPairs(set, "hq");
    // right
    set.applyPattern("[a-m]");
    set2.applyPattern("[e-o]");
    set.addAll(set2);
    expectPairs(set, "ao");
    // left
    set.applyPattern("[e-o]");
    set2.applyPattern("[a-m]");
    set.addAll(set2);
    expectPairs(set, "ao");
    // 1 overlap against 3
    set.applyPattern("[a-eg-mo-w]");
    set2.applyPattern("[d-q]");
    set.addAll(set2);
    expectPairs(set, "aw");
  }

View Full Code Here

    expectPairs(set, "aw");
  }
  
  public void TestAPI() {
    // default ct
    UnicodeSet set = new UnicodeSet();
    if (!set.isEmpty() || set.getRangeCount() != 0) {
      errln("FAIL, set should be empty but isn't: " +
          set);
    }
    
    // clear(), isEmpty()
    set.add('a');
    if (set.isEmpty()) {
      errln("FAIL, set shouldn't be empty but is: " +
          set);
    }
    set.clear();
    if (!set.isEmpty()) {
      errln("FAIL, set should be empty but isn't: " +
          set);
    }
    
    // size()
    set.clear();
    if (set.size() != 0) {
      errln("FAIL, size should be 0, but is " + set.size() +
          ": " + set);
    }
    set.add('a');
    if (set.size() != 1) {
      errln("FAIL, size should be 1, but is " + set.size() +
          ": " + set);
    }
    set.add('1', '9');
    if (set.size() != 10) {
      errln("FAIL, size should be 10, but is " + set.size() +
          ": " + set);
    }
    set.clear();
    set.complement();
    if (set.size() != 0x110000) {
      errln("FAIL, size should be 0x110000, but is" + set.size());
    }
    
    // contains(first, last)
    set.clear();
    set.applyPattern("[A-Y 1-8 b-d l-y]");
    for (int i = 0; i<set.getRangeCount(); ++i) {
      int a = set.getRangeStart(i);
      int b = set.getRangeEnd(i);
      if (!set.contains(a, b)) {
        errln("FAIL, should contain " + (char)a + '-' + (char)b +
            " but doesn't: " + set);
      }
      if (set.contains((char)(a-1), b)) {
        errln("FAIL, shouldn't contain " +
            (char)(a-1) + '-' + (char)b +
            " but does: " + set);
      }
      if (set.contains(a, (char)(b+1))) {
        errln("FAIL, shouldn't contain " +
            (char)a + '-' + (char)(b+1) +
            " but does: " + set);
      }
    }
    
    // Ported InversionList test.
    UnicodeSet a = new UnicodeSet((char)3,(char)10);
    UnicodeSet b = new UnicodeSet((char)7,(char)15);
    UnicodeSet c = new UnicodeSet();
    
    logln("a [3-10]: " + a);
    logln("b [7-15]: " + b);
    c.set(a); c.addAll(b);
    UnicodeSet exp = new UnicodeSet((char)3,(char)15);
    if (c.equals(exp)) {
      logln("c.set(a).add(b): " + c);
    } else {
      errln("FAIL: c.set(a).add(b) = " + c + ", expect " + exp);
    }
    c.complement();
    exp.set((char)0, (char)2);
    exp.add((char)16, UnicodeSet.MAX_VALUE);
    if (c.equals(exp)) {
      logln("c.complement(): " + c);
    } else {
      errln(Utility.escape("FAIL: c.complement() = " + c + ", expect " + exp));
    }
    c.complement();
    exp.set((char)3, (char)15);
    if (c.equals(exp)) {
      logln("c.complement(): " + c);
    } else {
      errln("FAIL: c.complement() = " + c + ", expect " + exp);
    }
    c.set(a); c.complementAll(b);
    exp.set((char)3,(char)6);
    exp.add((char)11,(char) 15);
    if (c.equals(exp)) {
      logln("c.set(a).complement(b): " + c);
    } else {
      errln("FAIL: c.set(a).complement(b) = " + c + ", expect " + exp);
    }
    
    exp.set(c);
    c = bitsToSet(setToBits(c));
    if (c.equals(exp)) {
      logln("bitsToSet(setToBits(c)): " + c);
    } else {
      errln("FAIL: bitsToSet(setToBits(c)) = " + c + ", expect " + exp);
    } 
    
    // Additional tests for coverage JB#2118
    //UnicodeSet::complement(class UnicodeString const &)
    //UnicodeSet::complementAll(class UnicodeString const &)
    //UnicodeSet::containsNone(class UnicodeSet const &)
    //UnicodeSet::containsNone(long,long)
    //UnicodeSet::containsSome(class UnicodeSet const &)
    //UnicodeSet::containsSome(long,long)
    //UnicodeSet::removeAll(class UnicodeString const &)
    //UnicodeSet::retain(long)
    //UnicodeSet::retainAll(class UnicodeString const &)
    //UnicodeSet::serialize(unsigned short *,long,enum UErrorCode &)
    //UnicodeSetIterator::getString(void)
    set.clear();
    set.complement("ab");
    exp.applyPattern("[{ab}]");
    if (!set.equals(exp)) { errln("FAIL: complement(\"ab\")"); return; }
    
    UnicodeSetIterator iset = new UnicodeSetIterator(set);
    if (!iset.next() || iset.codepoint != UnicodeSetIterator.IS_STRING) {
      errln("FAIL: UnicodeSetIterator.next/IS_STRING");
    } else if (!iset.string.equals("ab")) {
      errln("FAIL: UnicodeSetIterator.string");
    }
    
    set.add((char)0x61, (char)0x7A);
    set.complementAll("alan");
    exp.applyPattern("[{ab}b-kmo-z]");
    if (!set.equals(exp)) { errln("FAIL: complementAll(\"alan\")"); return; }
    
    exp.applyPattern("[a-z]");
    if (set.containsNone(exp)) { errln("FAIL: containsNone(UnicodeSet)"); }
    if (!set.containsSome(exp)) { errln("FAIL: containsSome(UnicodeSet)"); }
    exp.applyPattern("[aln]");
    if (!set.containsNone(exp)) { errln("FAIL: containsNone(UnicodeSet)"); }
    if (set.containsSome(exp)) { errln("FAIL: containsSome(UnicodeSet)"); }
    
    if (set.containsNone((char)0x61, (char)0x7A)) {
      errln("FAIL: containsNone(char, char)");
    }
    if (!set.containsSome((char)0x61, (char)0x7A)) {
      errln("FAIL: containsSome(char, char)");
    }
    if (!set.containsNone((char)0x41, (char)0x5A)) {
      errln("FAIL: containsNone(char, char)");
    }
    if (set.containsSome((char)0x41, (char)0x5A)) {
      errln("FAIL: containsSome(char, char)");
    }
    
    set.removeAll("liu");
    exp.applyPattern("[{ab}b-hj-kmo-tv-z]");
    if (!set.equals(exp)) { errln("FAIL: removeAll(\"liu\")"); return; }
    
    set.retainAll("star");
    exp.applyPattern("[rst]");
    if (!set.equals(exp)) { errln("FAIL: retainAll(\"star\")"); return; }
    
    set.retain((char)0x73);
    exp.applyPattern("[s]");
    if (!set.equals(exp)) { errln("FAIL: retain('s')"); return; }
    
    // ICU 2.6 coverage tests
    // public final UnicodeSet retain(String s);
    // public final UnicodeSet remove(int c);
    // public final UnicodeSet remove(String s);
    // public int hashCode();
    set.applyPattern("[a-z{ab}{cd}]");
    set.retain("cd");
    exp.applyPattern("[{cd}]");
    if (!set.equals(exp)) { errln("FAIL: retain(\"cd\")"); return; }
    
    set.applyPattern("[a-z{ab}{cd}]");
    set.remove((char)0x63);
    exp.applyPattern("[abd-z{ab}{cd}]");
    if (!set.equals(exp)) { errln("FAIL: remove('c')"); return; }
    
    set.remove("cd");
    exp.applyPattern("[abd-z{ab}]");
    if (!set.equals(exp)) { errln("FAIL: remove(\"cd\")"); return; }
    
    if (set.hashCode() != exp.hashCode()) {
      errln("FAIL: hashCode() unequal");
    }
    exp.clear();
    if (set.hashCode() == exp.hashCode()) {
      errln("FAIL: hashCode() equal");
    }
    
    {
      //Cover addAll(Collection) and addAllTo(Collection)

View Full Code Here

//  expectRelation(testList[i][0], testList[i][1], testList[i][2], "(" + i + ")");
//  }        
    
    UnicodeSet[][] testList = {
        {UnicodeSet.fromAll("abc"),
          new UnicodeSet("[a-c]")},
          
          {UnicodeSet.from("ch").add('a','z').add("ll"),
            new UnicodeSet("[{ll}{ch}a-z]")},
            
            {UnicodeSet.from("ab}c"),  
              new UnicodeSet("[{ab\\}c}]")},
              
              {new UnicodeSet('a','z').add('A', 'Z').retain('M','m').complement('X'), 
                new UnicodeSet("[[a-zA-Z]&[M-m]-[X]]")},
    };
    
    for (int i = 0; i < testList.length; ++i) {
      if (!testList[i][0].equals(testList[i][1])) {
        errln("FAIL: sets unequal; see source code (" + i + ")");

View Full Code Here

      expectContainment(DATA[i], DATA[i+1], DATA[i+2]);
    }
  }
  
  public void TestUnicodeSetStrings() {
    UnicodeSet uset = new UnicodeSet("[a{bc}{cd}pqr\u0000]");
    logln(uset + " ~ " + uset.getRegexEquivalent());
    String[][] testStrings = {{"x", "none"},
        {"bc", "all"},
        {"cdbca", "all"},
        {"a", "all"},
        {"bcx", "some"},

View Full Code Here

  
  /**
   * Test cloning of UnicodeSet
   */
  public void TestClone() {
    UnicodeSet s = new UnicodeSet("[abcxyz]");
    UnicodeSet t = (UnicodeSet) s.clone();
    expectContainment(t, "abc", "def");
  }

View Full Code Here

  
  /**
   * Test the indexOf() and charAt() methods.
   */
  public void TestIndexOf() {
    UnicodeSet set = new UnicodeSet("[a-cx-y3578]");
    for (int i=0; i<set.size(); ++i) {
      int c = set.charAt(i);
      if (set.indexOf(c) != i) {
        errln("FAIL: charAt(" + i + ") = " + c +
            " => indexOf() => " + set.indexOf(c));
      }
    }
    int c = set.charAt(set.size());
    if (c != -1) {
      errln("FAIL: charAt(<out of range>) = " +
          Utility.escape(String.valueOf(c)));
    }
    int j = set.indexOf('q');
    if (j != -1) {
      errln("FAIL: indexOf('q') = " + j);
    }
  }

View Full Code Here

      errln("FAIL: indexOf('q') = " + j);
    }
  }
  
  public void TestContainsString() {
    UnicodeSet x = new UnicodeSet("[a{bc}]");
    if (x.contains("abc")) errln("FAIL");
  }

View Full Code Here

        String pat = "";
        try {
          String name =
            (j==0) ? UScript.getName(i) : UScript.getShortName(i);
            pat = "[:" + name + ":]";
            UnicodeSet set = new UnicodeSet(pat);
            logln("Ok: " + pat + " -> " + set.toPattern(false));
        } catch (IllegalArgumentException e) {
          if (pat.length() == 0) {
            errln("FAIL (in UScript): No name for script " + i);
          } else {
            errln("FAIL: Couldn't create " + pat);

View Full Code Here

0 1 2 3 4 5 6 7 8 9

TOP

Related Classes of com.ibm.icu.text.UnicodeSet

com.ibm.icu.charset.CharsetISCII

com.ibm.icu.charset.CharsetSelector

com.ibm.icu.dev.demo.translit.CaseIterator

com.ibm.icu.dev.demo.translit.TransliterationChart

com.ibm.icu.dev.test.charset.TestConversion

com.ibm.icu.dev.test.charset.TestSelection

com.ibm.icu.dev.test.collator.RandomCollator

com.ibm.icu.dev.test.format.DateTimeGeneratorTest

com.ibm.icu.dev.test.lang.UCharacterCaseTest

com.ibm.icu.dev.test.lang.UCharacterTest

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.

`pattern :=`	`('[' '^'? item* ']') \| property`
`item :=`	`char \| (char '-' char) \| pattern-expr`
`pattern-expr :=`	`pattern \| pattern-expr pattern \| pattern-expr op pattern`
`op :=`	`'&' \| '-'`
`special :=`	`'[' \| ']' \| '-'`
`char :=`	any character that is not`special \| ('\\'`any character`) \| ('\u' hex hex hex hex)`
`hex :=`	any character for which `Character.digit(c, 16)` returns a non-negative result
`property :=`	a Unicode property set pattern