This class indexes a document collection (skipping the direct file construction). It implements a single-pass algorithm, that operates in two phases:
First, it traverses the document collection, passes the terms through the TermPipeline and builds an in-memory representation of the posting lists. When it has exhausted the main memory, it flushes the sorted postings to disk, along with the lexicon (collectively known as a
run, and continues traversing the collection.
The second phase, merges the sorted runs (with their partial lexicons) in disk to create the final inverted file. This class follows the template pattern, so the main bulk of the code is reused for block (and fields) indexing. There are a few hook methods, that chooses the right classes to instanciate, depending on the indexing options defined.
Memory tracking is a key concern in this class. Four properties are provided for checking the amount of memory consumed, how regularly to check the memory, and (optional) maximums on the amount of memory that can be used for the postings, or on the number of documents before a flush is comitted.
Properties:
- memory.reserved - amount of free memory threshold before a run is committed. Default is 50 000 000 (50MB) and 100 000 000 (100MB) for 32bit and 64bit JVMs respectively.
- memory.heap.usage - proportion of max heap allocated to JVM before a run is committed. Default 0.70.
- indexing.singlepass.max.postings.memory - maximum amount of memory that the postings can consume before a run is committed.
- indexing.singlepass.max.documents.flush - maximum number of documents before a run is committed.
- docs.check - interval of how many documents indexed should the amount of free memory be checked. Defaults to 20.
@author Roi Blanco