NOTE: this format is still experimental and subject to change without backwards compatibility.
Basic idea:
In packed blocks, integers are encoded with the same bit width ( {@link PackedInts packed format}): the block size (i.e. number of integers inside block) is fixed (currently 128). Additionally blocks that are all the same value are encoded in an optimized way.
In VInt blocks, integers are encoded as {@link DataOutput#writeVInt VInt}: the block size is variable.
When the postings are long enough, Lucene41PostingsFormat will try to encode most integer data as a packed block.
Take a term with 259 documents as an example, the first 256 document ids are encoded as two packed blocks, while the remaining 3 are encoded as one VInt block.
Different kinds of data are always encoded separately into different packed blocks, but may possibly be interleaved into the same VInt block.
This strategy is applied to pairs: <document number, frequency>, <position, payload length>, <position, offset start, offset length>, and <position, payload length, offsetstart, offset length>.
The structure of skip table is quite similar to previous version of Lucene. Skip interval is the same as block size, and each skip entry points to the beginning of each block. However, for the first block, skip data is omitted.
A position is an integer indicating where the term occurs within one document. A payload is a blob of metadata associated with current position. An offset is a pair of integers indicating the tokenized start/end offsets for given term in current position: it is essentially a specialized payload.
When payloads and offsets are not omitted, numPositions==numPayloads==numOffsets (assuming a null payload contributes one count). As mentioned in block structure, it is possible to encode these three either combined or separately.
In all cases, payloads and offsets are stored together. When encoded as a packed block, position data is separated out as .pos, while payloads and offsets are encoded in .pay (payload metadata will also be stored directly in .pay). When encoded as VInt blocks, all these three are stored interleaved into the .pos (so is payload metadata).
With this strategy, the majority of payload and offset data will be outside .pos file. So for queries that require only position data, running on a full index with payloads and offsets, this reduces disk pre-fetches.
Files and detailed format:
The .tim file contains the list of terms in each field along with per-term statistics (such as docfreq) and pointers to the frequencies, positions, payload and skip data in the .doc, .pos, and .pay files. See {@link BlockTreeTermsWriter} for more details on the format.
NOTE: The term dictionary can plug into different postings implementations: the postings writer/reader are actually responsible for encoding and decoding the Postings Metadata and Term Metadata sections described here:
Notes:
The .tip file contains an index into the term dictionary, so that it can be accessed randomly. See {@link BlockTreeTermsWriter} for more details on the format.
The .doc file contains the lists of documents which contain each term, along with the frequency of the term in that document (except when frequencies are omitted: {@link IndexOptions#DOCS_ONLY}). It also saves skip data to the beginning of each packed or VInt block, when the length of document list is larger than packed block size.
Notes:
DocDelta: if frequencies are indexed, this determines both the document number and the frequency. In particular, DocDelta/2 is the difference between this document number and the previous document number (or zero when this is the first document in a TermFreqs). When DocDelta is odd, the frequency is one. When DocDelta is even, the frequency is read as another VInt. If frequencies are omitted, DocDelta contains the gap (not multiplied by 2) between document numbers and no frequency information is stored.
For example, the TermFreqs for a term which occurs once in document seven and three times in document eleven, with frequencies indexed, would be the following sequence of VInts:
15, 8, 3
If frequencies were omitted ( {@link IndexOptions#DOCS_ONLY}) it would be this sequence of VInts instead:
7,4
The .pos file contains the lists of positions that each term occurs at within documents. It also sometimes stores part of payloads and offsets for speedup.
Notes:
4, 5, 4
The .pay file will store payloads and offsets associated with certain term-document positions. Some payloads and offsets will be separated out into .pos file, for performance reasons.
Notes:
|
|
|
|
|
|
|
|
|
|
|
|