Examples of com.ibm.icu.text.UnicodeSet

com.ibm.icu.text.UnicodeSet

cu-project.org/userguide/unicodeSet.html"> http://www.icu-project.org/userguide/unicodeSet.html. Actual determination of property data is defined by the underlying Unicode database as implemented by UCharacter.

Patterns specify individual characters, ranges of characters, and Unicode property sets. When elements are concatenated, they specify their union. To complement a set, place a '^' immediately after the opening '['. Property patterns are inverted by modifying their delimiters; "[:^foo]" and "\P{foo}". In any other location, '^' has no special meaning.

Ranges are indicated by placing two a '-' between two characters, as in "a-z". This specifies the range of all characters from the left to the right, in Unicode order. If the left character is greater than or equal to the right character it is a syntax error. If a '-' occurs as the first character after the opening '[' or '[^', or if it occurs as the last character before the closing ']', then it is taken as a literal. Thus "[a\\-b]", "[-ab]", and "[ab-]" all indicate the same set of three characters, 'a', 'b', and '-'.

Sets may be intersected using the '&' operator or the asymmetric set difference may be taken using the '-' operator, for example, "[[:L:]&[\\u0000-\\u0FFF]]" indicates the set of all Unicode letters with values less than 4096. Operators ('&' and '|') have equal precedence and bind left-to-right. Thus "[[:L:]-[a-z]-[\\u0100-\\u01FF]]" is equivalent to "[[[:L:]-[a-z]]-[\\u0100-\\u01FF]]". This only really matters for difference; intersection is commutative.

`[a]`	The set containing 'a'
`[a-z]`	The set containing 'a' through 'z' and all letters in between, in Unicode order
`[^a-z]`	The set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+10FFFF
`[[pat1][pat2]]`	The union of sets specified by pat1 and pat2
`[[pat1]&[pat2]]`	The intersection of sets specified by pat1 and pat2
`[[pat1]-[pat2]]`	The asymmetric difference of sets specified by pat1 and pat2
`[:Lu:] or \p{Lu}`	The set of characters having the specified Unicode property; in this case, Unicode uppercase letters
`[:^Lu:] or \P{Lu}`	The set of characters not having the given Unicode property

Warning: you cannot add an empty string ("") to a UnicodeSet.

Formal syntax

pattern := ('[' '^'? item* ']') | property

item := char | (char '-' char) | pattern-expr

pattern-expr := pattern | pattern-expr pattern | pattern-expr op pattern

op := '&' | '-'

special := '[' | ']' | '-'

char := any character that is notspecial | ('\\'any character) | ('\u' hex hex hex hex)

hex := any character for which Character.digit(c, 16) returns a non-negative result

property := a Unicode property set pattern

Legend:

a := b a may be replaced by b

a? zero or one instance of a

a* one or more instances of a

a | b either a or b

'a' the literal string between the quotes

To iterate over contents of UnicodeSet, use UnicodeSetIterator class. @author Alan Liu @stable ICU 2.0 @see UnicodeSetIterator

                             "StringTokenizer", "constructors!"};
        StringTokenizer defaultst = new StringTokenizer(str);
        StringTokenizer stdelimiter = new StringTokenizer(str, delimiter);
        StringTokenizer stdelimiterreturn = new StringTokenizer(str, delimiter,
                                                                false);
        UnicodeSet delimiterset = new UnicodeSet("[" + delimiter + "]", false);
        StringTokenizer stdelimiterset = new StringTokenizer(str, delimiterset);
        StringTokenizer stdelimitersetreturn = new StringTokenizer(str, 
                                                                delimiterset,
                                                                false);
        for (int i = 0; i < expected.length; i ++) {
            if (!(defaultst.nextElement().equals(expected[i]) 
                  && stdelimiter.nextElement().equals(expected[i])
                  && stdelimiterreturn.nextElement().equals(expected[i])
                  && stdelimiterset.nextElement().equals(expected[i])
                  && stdelimitersetreturn.nextElement().equals(expected[i]))) {
                errln("Constructor with default delimiter gives wrong results");
            }
        }
        
        String expected1[] = {"this", "\t", "is", "\n", "a", "\r", "string", "\f",
                            "testing", "\t", "StringTokenizer", "\n",
                            "constructors!"};
        stdelimiterreturn = new StringTokenizer(str, delimiter, true);
        stdelimitersetreturn = new StringTokenizer(str, delimiterset, true);
        for (int i = 0; i < expected1.length; i ++) {
            if (!(stdelimiterreturn.nextElement().equals(expected1[i])
                  && stdelimitersetreturn.nextElement().equals(expected1[i]))) {
                errln("Constructor with default delimiter and delimiter tokens gives wrong results");
            }
        }
                            
        stdelimiter = new StringTokenizer(str, (String)null);
        stdelimiterreturn = new StringTokenizer(str, (String)null, false);
        delimiterset = null;
        stdelimiterset = new StringTokenizer(str, delimiterset);
        stdelimitersetreturn = new StringTokenizer(str, delimiterset, false);
        
        if (!(stdelimiter.nextElement().equals(str)
              && stdelimiterreturn.nextElement().equals(str)
              && stdelimiterset.nextElement().equals(str)
              && stdelimitersetreturn.nextElement().equals(str))) {
            errln("Constructor with null delimiter gives wrong results");
        }
        
        delimiter = "";
        stdelimiter = new StringTokenizer(str, delimiter);
        stdelimiterreturn = new StringTokenizer(str, delimiter, false);
        delimiterset = new UnicodeSet();
        stdelimiterset = new StringTokenizer(str, delimiterset);
        stdelimitersetreturn = new StringTokenizer(str, delimiterset, false);
        
        if (!(stdelimiter.nextElement().equals(str)
              && stdelimiterreturn.nextElement().equals(str)

View Full Code Here

     */
    private static final synchronized UnicodeSet internalGetNXHangul() {
        /* internal function, does not check for incoming U_FAILURE */
    
        if(nxCache[NX_HANGUL]==null) {
             nxCache[NX_HANGUL]=new UnicodeSet(0xac00, 0xd7a3);
        }
        return nxCache[NX_HANGUL];
    }

View Full Code Here

//printTable();
    }


    public void buildColumnMap(InputStreamReader in) throws IOException {
System.out.println("Building column map...");
        UnicodeSet charsInFile = new UnicodeSet();
        int c = in.read();
int totalChars = 0;
        while (c >= 0) {
++totalChars; if (totalChars > 0 && totalChars % 5000 == 0) System.out.println("Read " + totalChars + " characters...");
            if (c > ' ')
                charsInFile.add((char)c);
            c = in.read();
        }
//        Test.debugPrintln(charsInFile.toString());


        StringBuffer tempReverseMap = new StringBuffer();
        tempReverseMap.append(' ');


        columnMap = new CompactByteArray();
        int n = charsInFile.getRangeCount();
        byte p = 1;
        for (int i=0; i<n; ++i) {
            char start = (char) charsInFile.getRangeStart(i);
            char end = (char) charsInFile.getRangeEnd(i);
            for (char ch = start; ch <= end; ch++) {
                if (columnMap.elementAt(Character.toLowerCase(ch)) == 0) {
                    columnMap.setElementAt(Character.toUpperCase(ch), Character.toUpperCase(ch),
                                        p);
                    columnMap.setElementAt(Character.toLowerCase(ch), Character.toLowerCase(ch),

View Full Code Here

        /* internal function, does not check for incoming U_FAILURE */
    
        if(nxCache[NX_CJK_COMPAT]==null) {


            /* build a set from [CJK Ideographs]&[has canonical decomposition] */
            UnicodeSet set, hasDecomp;
    
            set=new UnicodeSet("[:Ideographic:]");
    
            /* start with an empty set for [has canonical decomposition] */
            hasDecomp=new UnicodeSet();
    
            /* iterate over all ideographs and remember which canonically decompose */
            UnicodeSetIterator it = new UnicodeSetIterator(set);
            int start, end;
            long norm32;

View Full Code Here

            return null;
        }
    
        if(nxCache[options]==null) {
            /* build a set with all code points that were not designated by the specified Unicode version */
            UnicodeSet set = new UnicodeSet();


            switch(options) {
            case Normalizer.UNICODE_3_2:
                set.applyPattern("[:^Age=3.2:]");
                break;
            default:
                return null;
            }

View Full Code Here

            if((options & OPTIONS_UNICODE_MASK)!=0 && (options & OPTIONS_NX_MASK)==0) {
                return internalGetNXUnicode(options);
            }
    
            /* build a set from multiple subsets */
            UnicodeSet set;
            UnicodeSet other;
    
            set=new UnicodeSet();


    
            if((options & NX_HANGUL)!=0 && null!=(other=internalGetNXHangul())) {
                set.addAll(other);
            }

View Full Code Here

        for (int script = fMinScript; script <= fMaxScript; script += 1) {
            fScriptNames[script - fMinScript] = UScript.getName(script).toUpperCase();
            fScriptTags[script - fMinScript]  = UScript.getShortName(script).toLowerCase();
            
            if (script != commonScript) {
                UnicodeSet scriptSet  = new UnicodeSet("\\p{" + fScriptTags[script - fMinScript] + "}");
                UnicodeSetIterator it = new UnicodeSetIterator(scriptSet);
            
                while (it.nextRange()) {
                    Record record = new Record(it.codepoint, it.codepointEnd, script);

View Full Code Here

    // TODO: The UnicodeSet is constrained to the BMP because the ClassTable data structure can
    // only handle 16-bit entries. This is probably OK as long as there aren't any joining scripts
    // outside of the BMP...
    public void buildShapingTypes(String filename)
    {
        UnicodeSet shapingTypes = new UnicodeSet("[[\\P{Joining_Type=Non_Joining}] & [\\u0000-\\uFFFF]]");
        int count = shapingTypes.size();
        
        System.out.println("There are " + count + " characters with a joining type.");
        
        for(int i = 0; i < count; i += 1) {
            int ch = shapingTypes.charAt(i);
            
            classTable.addMapping(ch, UCharacter.getIntPropertyValue(ch, UProperty.JOINING_TYPE));
        }
        
        LigatureModuleWriter writer = new LigatureModuleWriter();

View Full Code Here

                + ", NameChoice: " + nameChoice + ", "
                + e1.getClass().getName());
            continue;
          }
          logln("Value (" + valueNum + "): " + valueName);
          UnicodeSet testSet;
          try {
            testSet = new UnicodeSet("[:" + propName + "=" + valueName + ":]");
          } catch (RuntimeException e) {
            errln("Can't create UnicodeSet for: "
                + "Property (" + propNum + "): " + propName + ", " 
                + "Value (" + valueNum + "): " + valueName + ", "
                + e.getClass().getName());
            continue;
          }
          UnicodeSet collectedErrors = new UnicodeSet();
          for (UnicodeSetIterator it = new UnicodeSetIterator(testSet); it.next();) {
            int value = UCharacter.getIntPropertyValue(it.codepoint, propNum);
            if (value != valueNum) {
              collectedErrors.add(it.codepoint);
            }
          }
          if (collectedErrors.size() != 0) {
            errln("Property Value Differs: " 
                + "Property (" + propNum + "): " + propName + ", " 
                + "Value (" + valueNum + "): " + valueName + ", "
                + "Differing values: " + collectedErrors.toPattern(true));
          }
        }
      } 
    }
  }

View Full Code Here

   */
  public void TestToPattern() throws Exception {
    // Test that toPattern() round trips with syntax characters
    // and whitespace.
    for (int i = 0; i < OTHER_TOPATTERN_TESTS.length; ++i) {
      checkPat(OTHER_TOPATTERN_TESTS[i], new UnicodeSet(OTHER_TOPATTERN_TESTS[i]));
    }
    for (int i = 0; i <= 0x10FFFF; ++i) {
      if ((i <= 0xFF && !UCharacter.isLetter(i)) || UCharacter.isWhitespace(i)) {
        // check various combinations to make sure they all work.
        if (i != 0 && !toPatternAux(i, i)) continue;
        if (!toPatternAux(0, i)) continue;
        if (!toPatternAux(i, 0xFFFF)) continue;
      }
    } 
    
    // Test pattern behavior of multicharacter strings.
    UnicodeSet s = new UnicodeSet("[a-z {aa} {ab}]");
    expectToPattern(s, "[a-z{aa}{ab}]",
        new String[] {"aa", "ab", NOT, "ac"});
    s.add("ac");
    expectToPattern(s, "[a-z{aa}{ab}{ac}]",
        new String[] {"aa", "ab", "ac", NOT, "xy"});
    
    s.applyPattern("[a-z {\\{l} {r\\}}]");
    expectToPattern(s, "[a-z{r\\}}{\\{l}]",
        new String[] {"{l", "r}", NOT, "xy"});
    s.add("[]");
    expectToPattern(s, "[a-z{\\[\\]}{r\\}}{\\{l}]",
        new String[] {"{l", "r}", "[]", NOT, "xy"});
    
    s.applyPattern("[a-z {\u4E01\u4E02}{\\n\\r}]");
    expectToPattern(s, "[a-z{\\u000A\\u000D}{\\u4E01\\u4E02}]",
        new String[] {"\u4E01\u4E02", "\n\r"});
    
    s.clear();
    s.add("abc");
    s.add("abc");
    expectToPattern(s, "[{abc}]",
        new String[] {"abc", NOT, "ab"});
    
    // JB#3400: For 2 character ranges prefer [ab] to [a-b]
    s.clear(); 
    s.add('a', 'b');
    expectToPattern(s, "[ab]", null);
    
    // Cover applyPattern, applyPropertyAlias
    s.clear();
    s.applyPattern("[ab ]", true);
    expectToPattern(s, "[ab]", new String[] {"a", NOT, "ab", " "});
    s.clear();
    s.applyPattern("[ab ]", false);
    expectToPattern(s, "[\\ ab]", new String[] {"a", "\u0020", NOT, "ab"});
    
    s.clear();
    s.applyPropertyAlias("nv", "0.5");
    expectToPattern(s, "[\\u00BD\\u0D74\\u0F2A\\u2CFD\\U00010141\\U00010175\\U00010176]", null);
    // Unicode 5.1 adds Malayalam 1/2 (\u0D74)
    
    s.clear();
    s.applyPropertyAlias("gc", "Lu");
    // TODO expectToPattern(s, what?)


    // RemoveAllStrings()
    s.clear();
    s.applyPattern("[a-z{abc}{def}]");
    expectToPattern(s, "[a-z{abc}{def}]", null);
    s.removeAllStrings();
    expectToPattern(s, "[a-z]", null);
  }

View Full Code Here

0 1 2 3 4 5 6 7 8 9

TOP

Related Classes of com.ibm.icu.text.UnicodeSet

com.ibm.icu.charset.CharsetISCII

com.ibm.icu.charset.CharsetSelector

com.ibm.icu.dev.demo.translit.CaseIterator

com.ibm.icu.dev.demo.translit.TransliterationChart

com.ibm.icu.dev.test.charset.TestConversion

com.ibm.icu.dev.test.charset.TestSelection

com.ibm.icu.dev.test.collator.RandomCollator

com.ibm.icu.dev.test.format.DateTimeGeneratorTest

com.ibm.icu.dev.test.lang.UCharacterCaseTest

com.ibm.icu.dev.test.lang.UCharacterTest

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.

`pattern :=`	`('[' '^'? item* ']') \| property`
`item :=`	`char \| (char '-' char) \| pattern-expr`
`pattern-expr :=`	`pattern \| pattern-expr pattern \| pattern-expr op pattern`
`op :=`	`'&' \| '-'`
`special :=`	`'[' \| ']' \| '-'`
`char :=`	any character that is not`special \| ('\\'`any character`) \| ('\u' hex hex hex hex)`
`hex :=`	any character for which `Character.digit(c, 16)` returns a non-negative result
`property :=`	a Unicode property set pattern