A builder is usually based on a {@linkplain #basename() basename}. Many different collections can be built using the same builder, using {@link #open(CharSequence)}to specify a suffix that will be added to the basename. Creating several collections is a simple way to make collection construction scalable: for instance, {@link Scan} createsseveral collections, one per batch, and then puts them together using a {@link ConcatenatedDocumentCollection}.
After creating an instance of this class and after having opened a new collection, it is possible to add incrementally new documents. Each document must be started with {@link #startDocument(CharSequence,CharSequence)}and ended with {@link #endDocument()}; inside each document, each non-text field must be written by passing an object to {@link #nonTextField(Object)}, whereas each text field must be started with {@link #startTextField()} and ended with {@link #endTextField()}: inbetween, a call to {@link #add(MutableString,MutableString)} must be made for each word/nonword pair retrievedfrom the original collection. At the end, {@link #close()} returns a {@link it.unimi.dsi.mg4j.document.ZipDocumentCollection}that must be serialised.
Several collections (e.g., {@link SimpleCompressedDocumentCollection}, {@link ZipDocumentCollection}) can be exact or approximated: in the latter case, nonwords are not recorded to decrease space usage.
|
|
|
|