Examples of net.bpiwowar.mg4j.extensions.trec.TRECDocumentCollection

net.bpiwowar.mg4j.extensions.trec.TRECDocumentCollection

A collection for the TREC data set.

The documents are stored as a set of descriptors, representing the (possibly gzipped) files they are contained in and the start and stop position in that files. To manage descriptors later we rely on {@link SegmentedInputStream}.

To interpret a files, we read up to <DOC> and place a start marker there, we advance to the header and store the URI. An intermediate marker is placed at the end of the doc header tag and a stop marker just before </DOC>.

The collection provides both sequential access to all documents via the iterator and random access to a given document. However, the two operations are performed very differently as the sequential operation is much more performant than calling {@link #document(long)} repeatedly. @author Alessio Orlandi @author Luca Natali @author Benjamin Piwowarski

            case "trec":
                Properties properties = new Properties();
                properties.setProperty(PropertyBasedDocumentFactory.MetadataKeys.ENCODING, "UTF-8");
                final TRECDocumentFactory documentFactory = new TRECDocumentFactory(properties);


                collection = new TRECDocumentCollection(files,
                            documentFactory, SegmentedDocumentCollection.DEFAULT_BUFFER_SIZE, compression, metadataFile);
                break;


            case "warc/0.18":
                collection = new WARCDocumentCollection(files, SegmentedDocumentCollection.DEFAULT_BUFFER_SIZE, compression, metadataFile);

Examples of net.bpiwowar.mg4j.extensions.trec.TRECDocumentCollection

Related Classes of net.bpiwowar.mg4j.extensions.trec.TRECDocumentCollection