Pasting is a very slow way of combining indices: we assume that not only documents, but also document occurrences might be scattered throughout several indices. When a document appears in several indices, its occurrences in a given index are combined. We have two possibilities:
Standard pasting is used, for instance, to paste the batches of a {@linkplain it.unimi.dsi.mg4j.document.DocumentFactory.FieldType#VIRTUAL virtual field}generated by {@link Scan}; the latter takes care of numbering positions correctly. If, however, you index parts of the same document collection on different machines using the same {@link VirtualDocumentResolver}, then the resulting indices for virtual fields will have all position starting from zero, and they will need an incremental pasting to be combined correctly.
Conceptually, this operation is equivalent to splitting a collection vertically: each document is divided into a fixed number n of consecutive segments (possibly of length 0), and a set of n indices is created using the k-th segment of all documents. Pasting the resulting indices will produce an index that is identical to the index generated by the original collection. The behaviour is analogous to that of the UN*X paste command if documents are single-line lists of words.
Note that in case every document appears at most in one index pasting is equivalent to {@linkplain it.unimi.dsi.mg4j.tool.Merge merging}. It is, however, significantly slower, as the presence of the same document in several lists makes it necessary to scan completely the inverted lists to be pasted to compute the frequency. To do so, an in-memory buffer is allocated. If an inverted list does not fit in the memory buffer, it is spilled on disk. Sizing correctly the buffer, and choosing a fast file system for the temporary directory can significantly affect performance.
Warning: incremental pasting is very memory-intensive, as a list of sizes must be loaded for each index. You can use the URI option succinctsizes=1 to load sizes in a succinct format, which will ease the problem. @author Sebastiano Vigna @since 1.0
|
|