A codec allows a sequence of bytes to be decoded into integer values (or vice versa). It uses a variable-length encoding and a modified sign representation such that small numbers are represented as a single byte, whilst larger numbers take more bytes to encode. The number may be signed or unsigned; if it is unsigned, it can be weighted towards positive numbers or equally distributed using a one's complement. The codec also supports delta coding, where a sequence of numbers is represented as a series of first-order differences. So a delta encoding of the integers [1..10] would be represented as a sequence of 10x1s. This allows the absolute value of a coded integer to fall outside of the 'small number' range, whilst still being encoded as a single byte. A codec is configured with four parameters:
- B
- The maximum number of bytes that each value is encoded as. B must be a value between [1..5]. For a pass-through coding (where each byte is encoded as itself, aka {@link #BYTE1}, B is 1 (each byte takes a maximum of 1 byte).
- H
- The radix of the integer. Values are defined as a sequence of values, where value
n
is multipled by H^n
. So the number 1234 may be represented as the sequence 4 3 2 1 with a radix (H) of 10. Note that other permutations are also possible; 43 2 1 will also encode 1234. The co-parameter L is defined as 256-H. This is important because only the last value in a sequence may be < L; all prior values must be > L. - S
- Whether the codec represents signed values (or not). This may have 3 values; 0 (unsigned), 1 (signed, ones complement) or 2 (signed, but not sure what the difference is) TODO Update documentation when I know what the difference is
- D
- Whether the codec represents a delta encoding. This may be 0 (no delta) or 1 (delta encoding). A delta encoding of 1 indicates that values are cumulative; a sequence of
1 1 1 1 1
will represent the sequence 1 2 3 4 5
. For this reason, the codec supports two variants of decode; one {@link #decode(InputStream,long) with} and one{@link #decode(InputStream) without} a last
parameter. If thecodec is a non-delta encoding, then the value is ignored if passed. If the codec is a delta encoding, it is a run-time error to call the value without the extra parameter, and the previous value should be returned. (It was designed this way to support multi-threaded access without requring a new instance of the Codec to be cloned for each use.) -
Codecs are notated as (B,H,S,D) and either D or S,D may be omitted if zero. Thus {@link #BYTE1} is denoted (1,256,0,0) or (1,256). The{@link #toString()} method prints out the condensed form of the encoding.Often, the last character in the name ( {@link #BYTE1}, {@link #UNSIGNED5}) gives a clue as to the B value. Those that start with U ( {@link #UDELTA5}, {@link #UNSIGNED5}) are unsigned; otherwise, in most cases, they are signed. The presence of the word Delta ( {@link #DELTA5}, {@link #UDELTA5}) indicates a delta encoding is used. This codec is really quite cool for storing compressed information, and could be used entirely separately from the Pack200 implementation for efficient transfer of integer data if required. Note that all information is byte-oriented; for decoding float/double information, the bit values are converted (not cast) into a long type. Note that long values are used throughout even though most may be cast to ints; this is primarily to avoid having to worry about signed values, even if it would be more efficient to do so. There are a number of standard codecs ( {@link #UDELTA5}, {@link #UNSIGNED5}, {@link #BYTE1}, {@link #CHAR3}) that are used in the implementation of many bands; but there are a variety of other ones, and indeed the specification assumes that other combinations of values can result in more specific and efficient formats. There are also a sequence of canonical encodings defined by the Pack200 specification, which allow a codec to be referred to by canonical number. TODO Add links to canonical numbers when this has been done.
@author Alex Blewitt
@version $Revision: $