A collection for the TREC data set.
The documents are stored as a set of descriptors, representing the (possibly gzipped) files they are contained in and the start and stop position in that files. To manage descriptors later we rely on {@link SegmentedInputStream}.
To interpret a files, we read up to
<DOC> and place a start marker there, we advance to the header and store the URI. An intermediate marker is placed at the end of the doc header tag and a stop marker just before
</DOC>.
The collection provides both sequential access to all documents via the iterator and random access to a given document. However, the two operations are performed very differently as the sequential operation is much more performant than calling {@link #document(long)} repeatedly.
@author Alessio Orlandi
@author Luca Natali
@author Benjamin Piwowarski