The DocValues metadata or .dvm file.
For DocValues field, this stores metadata, such as the offset into the DocValues data (.dvd)
DocValues metadata (.dvm) --> Header,<Entry>NumFields
- Entry --> NumericEntry | BinaryEntry | SortedEntry | SortedSetEntry
- NumericEntry --> GCDNumericEntry | TableNumericEntry | DeltaNumericEntry
- GCDNumericEntry --> NumericHeader,MinValue,GCD
- TableNumericEntry --> NumericHeader,TableSize, {@link DataOutput#writeLong Int64}TableSize
- DeltaNumericEntry --> NumericHeader
- NumericHeader --> FieldNumber,EntryType,NumericType,MissingOffset,PackedVersion,DataOffset,Count,BlockSize
- BinaryEntry --> FixedBinaryEntry | VariableBinaryEntry | PrefixBinaryEntry
- FixedBinaryEntry --> BinaryHeader
- VariableBinaryEntry --> BinaryHeader,AddressOffset,PackedVersion,BlockSize
- PrefixBinaryEntry --> BinaryHeader,AddressInterval,AddressOffset,PackedVersion,BlockSize
- BinaryHeader --> FieldNumber,EntryType,BinaryType,MissingOffset,MinLength,MaxLength,DataOffset
- SortedEntry --> FieldNumber,EntryType,BinaryEntry,NumericEntry
- SortedSetEntry --> EntryType,BinaryEntry,NumericEntry,NumericEntry
- FieldNumber,PackedVersion,MinLength,MaxLength,BlockSize,ValueCount --> {@link DataOutput#writeVInt VInt}
- EntryType,CompressionType --> {@link DataOutput#writeByte Byte}
- Header --> {@link CodecUtil#writeHeader CodecHeader}
- MinValue,GCD,MissingOffset,AddressOffset,DataOffset --> {@link DataOutput#writeLong Int64}
- TableSize --> {@link DataOutput#writeVInt vInt}
Sorted fields have two entries: a BinaryEntry with the value metadata, and an ordinary NumericEntry for the document-to-ord metadata.
SortedSet fields have three entries: a BinaryEntry with the value metadata, and two NumericEntries for the document-to-ord-index and ordinal list metadata.
FieldNumber of -1 indicates the end of metadata.
EntryType is a 0 (NumericEntry) or 1 (BinaryEntry)
DataOffset is the pointer to the start of the data in the DocValues data (.dvd)
NumericType indicates how Numeric values will be compressed:
- 0 --> delta-compressed. For each block of 16k integers, every integer is delta-encoded from the minimum value within the block.
- 1 -->, gcd-compressed. When all integers share a common divisor, only quotients are stored using blocks of delta-encoded ints.
- 2 --> table-compressed. When the number of unique numeric values is small and it would save space, a lookup table of unique values is written, followed by the ordinal for each document.
BinaryType indicates how Binary values will be stored:
- 0 --> fixed-width. All values have the same length, addressing by multiplication.
- 1 -->, variable-width. An address for each value is stored.
- 2 --> prefix-compressed. An address to the start of every interval'th value is stored.
MinLength and MaxLength represent the min and max byte[] value lengths for Binary values. If they are equal, then all values are of a fixed size, and can be addressed as DataOffset + (docID * length). Otherwise, the binary values are of variable size, and packed integer metadata (PackedVersion,BlockSize) is written for the addresses.
MissingOffset points to a byte[] containing a bitset of all documents that had a value for the field. If its -1, then there are no missing values.
The DocValues data or .dvd file.
For DocValues field, this stores the actual per-document data (the heavy-lifting)
DocValues data (.dvd) --> Header,<NumericData | BinaryData | SortedData>NumFields
- NumericData --> DeltaCompressedNumerics | TableCompressedNumerics | GCDCompressedNumerics
- BinaryData --> {@link DataOutput#writeByte Byte}DataLength,Addresses
- SortedData --> {@link FST FST<Int64>}
- DeltaCompressedNumerics --> {@link BlockPackedWriter BlockPackedInts(blockSize=16k)}
- TableCompressedNumerics --> {@link PackedInts PackedInts}
- GCDCompressedNumerics --> {@link BlockPackedWriter BlockPackedInts(blockSize=16k)}
- Addresses --> {@link MonotonicBlockPackedWriter MonotonicBlockPackedInts(blockSize=16k)}
SortedSet entries store the list of ordinals in their BinaryData as a sequences of increasing {@link DataOutput#writeVLong vLong}s, delta-encoded.