The SCSU works by using dynamically positioned windows consisting of 128 consecutive characters in Unicode. During compression, characters within a window are encoded in the compressed stream as the bytes 0x7F - 0xFF. The SCSU provides transparency for the characters (bytes) between U+0000 - U+00FF. The SCSU approximates the storage size of traditional character sets, for example 1 byte per character for ASCII or Latin-1 text, and 2 bytes per character for CJK ideographs.
USAGE
The static methods on UnicodeCompressor may be used in a straightforward manner to compress simple strings:
String s = ... ; // get string from somewhere byte [] compressed = UnicodeCompressor.compress(s);
The static methods have a fairly large memory footprint. For finer-grained control over memory usage, UnicodeCompressor offers more powerful APIs allowing iterative compression:
// Compress an array "chars" of length "len" using a buffer of 512 bytes // to the OutputStream "out" UnicodeCompressor myCompressor = new UnicodeCompressor(); final static int BUFSIZE = 512; byte [] byteBuffer = new byte [ BUFSIZE ]; int bytesWritten = 0; int [] unicharsRead = new int [1]; int totalCharsCompressed = 0; int totalBytesWritten = 0; do { // do the compression bytesWritten = myCompressor.compress(chars, totalCharsCompressed, len, unicharsRead, byteBuffer, 0, BUFSIZE); // do something with the current set of bytes out.write(byteBuffer, 0, bytesWritten); // update the no. of characters compressed totalCharsCompressed += unicharsRead[0]; // update the no. of bytes written totalBytesWritten += bytesWritten; } while(totalCharsCompressed < len); myCompressor.reset(); // reuse compressor@see UnicodeDecompressor @author Stephen F. Booth @stable ICU 2.4
|
|
|
|