Examples of org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter

org.apache.lucene.codecs.BlockTreeTermsWriter
Block-based terms index and dictionary writer.
Writes terms dict and index, block-encoding (column stride) each term's metadata for each set of terms between two index terms.
Files:
- .tim: Term Dictionary
- .tip: Term Index
Term Dictionary

The .tim file contains the list of terms in each field along with per-term statistics (such as docfreq) and per-term metadata (typically pointers to the postings list for that term in the inverted index).

The .tim is arranged in blocks: with blocks containing a variable number of entries (by default 25-48), where each entry is either a term or a reference to a sub-block.

NOTE: The term dictionary can plug into different postings implementations: the postings writer/reader are actually responsible for encoding and decoding the Postings Metadata and Term Metadata sections.
- TermsDict (.tim) --> Header, Postings Metadata, Block^NumBlocks, FieldSummary, DirOffset
- Block --> SuffixBlock, StatsBlock, MetadataBlock
- SuffixBlock --> EntryCount, SuffixLength, Byte^SuffixLength
- StatsBlock --> StatsLength, <DocFreq, TotalTermFreq>^EntryCount
- MetadataBlock --> MetaLength, <Term Metadata>^EntryCount
- FieldSummary --> NumFields, <FieldNumber, NumTerms, RootCodeLength, Byte^{RootCodeLength}, SumDocFreq, DocCount>^NumFields
- Header --> {@link CodecUtil#writeHeader CodecHeader}
- DirOffset --> {@link DataOutput#writeLong Uint64}
- EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength,NumFields, FieldNumber,RootCodeLength,DocCount --> {@link DataOutput#writeVInt VInt}
- TotalTermFreq,NumTerms,SumTotalTermFreq,SumDocFreq --> {@link DataOutput#writeVLong VLong}
Notes:
- Header is a {@link CodecUtil#writeHeader CodecHeader} storing the version informationfor the BlockTree implementation.
- DirOffset is a pointer to the FieldSummary section.
- DocFreq is the count of documents which contain the term.
- TotalTermFreq is the total number of occurrences of the term. This is encoded as the difference between the total number of occurrences and the DocFreq.
- FieldNumber is the fields number from {@link FieldInfos}. (.fnm)
- NumTerms is the number of unique terms for the field.
- RootCode points to the root block for the field.
- SumDocFreq is the total number of postings, the number of term-document pairs across the entire field.
- DocCount is the number of documents that have at least one posting for this field.
- PostingsMetadata and TermMetadata are plugged into by the specific postings implementation: these contain arbitrary per-file data (such as parameters or versioning information) and per-term data (such as pointers to inverted files).
Term Index

The .tip file contains an index into the term dictionary, so that it can be accessed randomly. The index is also used to determine when a given term cannot exist on disk (in the .tim file), saving a disk seek.
- TermsIndex (.tip) --> Header, FSTIndex^NumFields <IndexStartFP>^NumFields, DirOffset
- Header --> {@link CodecUtil#writeHeader CodecHeader}
- DirOffset --> {@link DataOutput#writeLong Uint64}
- IndexStartFP --> {@link DataOutput#writeVLong VLong}
- FSTIndex --> {@link FST FST<byte[]>}
Notes:
- The .tip file contains a separate FST for each field. The FST maps a term prefix to the on-disk block that holds all terms starting with that prefix. Each field's IndexStartFP points to its FST.
- DirOffset is a pointer to the start of the IndexStartFPs for all fields
- It's possible that an on-disk block would contain too many terms (more than the allowed maximum (default: 48)). When this happens, the block is sub-divided into new blocks (called "floor blocks"), and then the output in the FST for the block's prefix encodes the leading byte of each sub-block, and its file pointer.
@see BlockTreeTermsReader @lucene.experimental

      final int minTermsInBlock = _TestUtil.nextInt(random, 2, 100);
      final int maxTermsInBlock = Math.max(2, (minTermsInBlock-1)*2 + random.nextInt(100));


      boolean success = false;
      try {
        fields = new BlockTreeTermsWriter(state, postingsWriter, minTermsInBlock, maxTermsInBlock);
        success = true;
      } finally {
        if (!success) {
          postingsWriter.close();
        }

View Full Code Here

    final PostingsWriterBase docs = new Siren10PostingsWriter(state,
      this.getFactory());


    boolean success = false;
    try {
      final FieldsConsumer ret = new BlockTreeTermsWriter(state, docs,
        BlockTreeTermsWriter.DEFAULT_MIN_BLOCK_SIZE,
        BlockTreeTermsWriter.DEFAULT_MAX_BLOCK_SIZE);
      success = true;
      return ret;
    }

View Full Code Here

  public FieldsConsumer fieldsConsumer(SegmentWriteState state) throws IOException {
    PostingsWriterBase postingsWriter = new Lucene41PostingsWriter(state);


    boolean success = false;
    try {
      FieldsConsumer ret = new BlockTreeTermsWriter(state, 
                                                    postingsWriter,
                                                    minTermBlockSize, 
                                                    maxTermBlockSize);
      success = true;
      return ret;

View Full Code Here

    try {
      docsWriter = new Lucene41PostingsWriter(state);


      pulsingWriterInner = new PulsingPostingsWriter(2, docsWriter);
      pulsingWriter = new PulsingPostingsWriter(1, pulsingWriterInner);
      FieldsConsumer ret = new BlockTreeTermsWriter(state, pulsingWriter, 
          BlockTreeTermsWriter.DEFAULT_MIN_BLOCK_SIZE, BlockTreeTermsWriter.DEFAULT_MAX_BLOCK_SIZE);
      success = true;
      return ret;
    } finally {
      if (!success) {

View Full Code Here

      docsWriter = wrappedPostingsBaseFormat.postingsWriterBase(state);


      // Terms that have <= freqCutoff number of docs are
      // "pulsed" (inlined):
      pulsingWriter = new PulsingPostingsWriter(freqCutoff, docsWriter);
      FieldsConsumer ret = new BlockTreeTermsWriter(state, pulsingWriter, minBlockSize, maxBlockSize);
      success = true;
      return ret;
    } finally {
      if (!success) {
        IOUtils.closeWhileHandlingException(docsWriter, pulsingWriter);

View Full Code Here

  public FieldsConsumer fieldsConsumer(SegmentWriteState state) throws IOException {
    PostingsWriterBase postingsWriter = new Lucene41PostingsWriter(state);


    boolean success = false;
    try {
      FieldsConsumer ret = new BlockTreeTermsWriter(state, 
                                                    postingsWriter,
                                                    minTermBlockSize, 
                                                    maxTermBlockSize);
      success = true;
      return ret;

View Full Code Here

      final int minTermsInBlock = _TestUtil.nextInt(random, 2, 100);
      final int maxTermsInBlock = Math.max(2, (minTermsInBlock-1)*2 + random.nextInt(100));


      boolean success = false;
      try {
        fields = new BlockTreeTermsWriter(state, postingsWriter, minTermsInBlock, maxTermsInBlock);
        success = true;
      } finally {
        if (!success) {
          postingsWriter.close();
        }

View Full Code Here

      final int minTermsInBlock = _TestUtil.nextInt(random, 2, 100);
      final int maxTermsInBlock = Math.max(2, (minTermsInBlock-1)*2 + random.nextInt(100));


      boolean success = false;
      try {
        fields = new BlockTreeTermsWriter(state, postingsWriter, minTermsInBlock, maxTermsInBlock);
        success = true;
      } finally {
        if (!success) {
          postingsWriter.close();
        }

View Full Code Here

    // pluggable?  Ie so that this codec would record which
    // index impl was used, and switch on loading?
    // Or... you must make a new Codec for this?
    boolean success = false;
    try {
      FieldsConsumer ret = new BlockTreeTermsWriter(state, docs, minBlockSize, maxBlockSize);
      success = true;
      return ret;
    } finally {
      if (!success) {
        docs.close();

View Full Code Here

  public FieldsConsumer fieldsConsumer(SegmentWriteState state) throws IOException {
    PostingsWriterBase postingsWriter = new Lucene41PostingsWriter(state);


    boolean success = false;
    try {
      FieldsConsumer ret = new BlockTreeTermsWriter(state, 
                                                    postingsWriter,
                                                    minTermBlockSize, 
                                                    maxTermBlockSize);
      success = true;
      return ret;

View Full Code Here

0 1

TOP

Related Classes of org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter

org.apache.lucene.codecs.BlockTermState

org.apache.lucene.codecs.lucene40.Lucene40RWPostingsFormat

org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat

org.apache.lucene.codecs.mockrandom.MockRandomPostingsFormat

org.apache.lucene.codecs.nestedpulsing.NestedPulsingPostingsFormat

org.apache.lucene.codecs.pulsing.PulsingPostingsFormat

org.apache.lucene.codecs.TermStats

org.apache.lucene.store.IndexOutput

org.apache.lucene.util.BytesRef

org.apache.lucene.util.FixedBitSet

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.

Examples of org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter

Term Dictionary

Term Index

Related Classes of org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter