Lucene 4.0 DocValues format.
Files:
- .dv.cfs: {@link CompoundFileDirectory compound container}
- .dv.cfe: {@link CompoundFileDirectory compound entries}
Entries within the compound file:
- <segment>_<fieldNumber>.dat: data values
- <segment>_<fieldNumber>.idx: index into the .dat for DEREF types
There are several many types of {@code DocValues} with different encodings.From the perspective of filenames, all types store their values in .dat entries within the compound file. In the case of dereferenced/sorted types, the .dat actually contains only the unique values, and an additional .idx file contains pointers to these unique values.
Formats:
- {@code VAR_INTS} .dat --> Header, PackedType, MinValue, DefaultValue, PackedStream
- {@code FIXED_INTS_8} .dat --> Header, ValueSize, {@link DataOutput#writeByte Byte}maxdoc
- {@code FIXED_INTS_16} .dat --> Header, ValueSize,{@link DataOutput#writeShort Short}maxdoc
- {@code FIXED_INTS_32} .dat --> Header, ValueSize,{@link DataOutput#writeInt Int32}maxdoc
- {@code FIXED_INTS_64} .dat --> Header, ValueSize,{@link DataOutput#writeLong Int64}maxdoc
- {@code FLOAT_32} .dat --> Header, ValueSize, Float32maxdoc
- {@code FLOAT_64} .dat --> Header, ValueSize, Float64maxdoc
- {@code BYTES_FIXED_STRAIGHT} .dat --> Header, ValueSize,( {@link DataOutput#writeByte Byte} * ValueSize)maxdoc
- {@code BYTES_VAR_STRAIGHT} .idx --> Header, TotalBytes, Addresses
- {@code BYTES_VAR_STRAIGHT} .dat --> Header,( {@link DataOutput#writeByte Byte} * variable ValueSize)maxdoc
- {@code BYTES_FIXED_DEREF} .idx --> Header, NumValues, Addresses
- {@code BYTES_FIXED_DEREF} .dat --> Header, ValueSize,( {@link DataOutput#writeByte Byte} * ValueSize)NumValues
- {@code BYTES_VAR_DEREF} .idx --> Header, TotalVarBytes, Addresses
- {@code BYTES_VAR_DEREF} .dat --> Header,(LengthPrefix + {@link DataOutput#writeByte Byte} * variable ValueSize)NumValues
- {@code BYTES_FIXED_SORTED} .idx --> Header, NumValues, Ordinals
- {@code BYTES_FIXED_SORTED} .dat --> Header, ValueSize,( {@link DataOutput#writeByte Byte} * ValueSize)NumValues
- {@code BYTES_VAR_SORTED} .idx --> Header, TotalVarBytes, Addresses, Ordinals
- {@code BYTES_VAR_SORTED} .dat --> Header,( {@link DataOutput#writeByte Byte} * variable ValueSize)NumValues
Data Types:
- Header --> {@link CodecUtil#writeHeader CodecHeader}
- PackedType --> {@link DataOutput#writeByte Byte}
- MaxAddress, MinValue, DefaultValue --> {@link DataOutput#writeLong Int64}
- PackedStream, Addresses, Ordinals --> {@link PackedInts}
- ValueSize, NumValues --> {@link DataOutput#writeInt Int32}
- Float32 --> 32-bit float encoded with {@link Float#floatToRawIntBits(float)}then written as {@link DataOutput#writeInt Int32}
- Float64 --> 64-bit float encoded with {@link Double#doubleToRawLongBits(double)}then written as {@link DataOutput#writeLong Int64}
- TotalBytes --> {@link DataOutput#writeVLong VLong}
- TotalVarBytes --> {@link DataOutput#writeLong Int64}
- LengthPrefix --> Length of the data value as {@link DataOutput#writeVInt VInt} (maximumof 2 bytes)
Notes:
- PackedType is a 0 when compressed, 1 when the stream is written as 64-bit integers.
- Addresses stores pointers to the actual byte location (indexed by docid). In the VAR_STRAIGHT case, each entry can have a different length, so to determine the length, docid+1 is retrieved. A sentinel address is written at the end for the VAR_STRAIGHT case, so the Addresses stream contains maxdoc+1 indices. For the deduplicated VAR_DEREF case, each length is encoded as a prefix to the data itself as a {@link DataOutput#writeVInt VInt} (maximum of 2 bytes).
- Ordinals stores the term ID in sorted order (indexed by docid). In the FIXED_SORTED case, the address into the .dat can be computed from the ordinal as
Header+ValueSize+(ordinal*ValueSize)
because the byte length is fixed. In the VAR_SORTED case, there is double indirection (docid -> ordinal -> address), but an additional sentinel ordinal+address is always written (so there are NumValues+1 ordinals). To determine the length, ord+1's address is looked up as well. - {@code BYTES_VAR_STRAIGHT BYTES_VAR_STRAIGHT} in contrast to other straight variants uses a .idx file to improve lookup perfromance. In contrast to {@code BYTES_VAR_DEREF BYTES_VAR_DEREF} it doesn't apply deduplication of the document values.
@deprecated Only for reading old 4.0 and 4.1 segments