Examples of org.apache.lucene.codecs.lucene41.Lucene41StoredFieldsFormat

Package org.apache.lucene.codecs.lucene41

Examples of org.apache.lucene.codecs.lucene41.Lucene41StoredFieldsFormat

org.apache.lucene.codecs.lucene41.Lucene41RWCodec
google.com/p/lz4/">LZ4 compression algorithm, which is fast to compress and very fast to decompress data. Although the compression method that is used focuses more on speed than on compression ratio, it should provide interesting compression ratios for redundant inputs (such as log files, HTML or plain text).

File formats

Stored fields are represented by two files:
1. A fields data file (extension .fdt). This file stores a compact representation of documents in compressed blocks of 16KB or more. When writing a segment, documents are appended to an in-memory byte[] buffer. When its size reaches 16KB or more, some metadata about the documents is flushed to disk, immediately followed by a compressed representation of the buffer using the LZ4 compression format.
  
  Here is a more detailed description of the field data file format:
  - FieldData (.fdt) --> <Header>, PackedIntsVersion, <Chunk>^ChunkCount
  - Header --> {@link CodecUtil#writeHeader CodecHeader}
  - PackedIntsVersion --> {@link PackedInts#VERSION_CURRENT} as a {@link DataOutput#writeVInt VInt}
  - ChunkCount is not known in advance and is the number of chunks necessary to store all document of the segment
  - Chunk --> DocBase, ChunkDocs, DocFieldCounts, DocLengths, <CompressedDocs>
  - DocBase --> the ID of the first document of the chunk as a {@link DataOutput#writeVInt VInt}
  - ChunkDocs --> the number of documents in the chunk as a {@link DataOutput#writeVInt VInt}
  - DocFieldCounts --> the number of stored fields of every document in the chunk, encoded as followed:
    - if chunkDocs=1, the unique value is encoded as a {@link DataOutput#writeVInt VInt}
    - else read a {@link DataOutput#writeVInt VInt} (let's call it bitsRequired)
      if bitsRequired is 0 then all values are equal, and the common value is the following {@link DataOutput#writeVInt VInt}
      
      else bitsRequired is the number of bits required to store any value, and values are stored in a {@link PackedInts packed} array where every value is stored on exactly bitsRequired bits
  - DocLengths --> the lengths of all documents in the chunk, encoded with the same method as DocFieldCounts
  - CompressedDocs --> a compressed representation of <Docs> using the LZ4 compression format
  - Docs --> <Doc>^ChunkDocs
  - Doc --> <FieldNumAndType, Value>^{DocFieldCount}
  - FieldNumAndType --> a {@link DataOutput#writeVLong VLong}, whose 3 last bits are Type and other bits are FieldNum
  - Type -->
    - 0: Value is String
    - 1: Value is BinaryValue
    - 2: Value is Int
    - 3: Value is Float
    - 4: Value is Long
    - 5: Value is Double
    - 6, 7: unused
  - FieldNum --> an ID of the field
  - Value --> {@link DataOutput#writeString(String) String} | BinaryValue | Int | Float | Long | Double depending on Type
  - BinaryValue --> ValueLength <Byte>^ValueLength
  Notes
  - If documents are larger than 16KB then chunks will likely contain only one document. However, documents can never spread across several chunks (all fields of a single document are in the same chunk).
  - Given that the original lengths are written in the metadata of the chunk, the decompressor can leverage this information to stop decoding as soon as enough data has been decompressed.
  - In case documents are incompressible, CompressedDocs will be less than 0.5% larger than Docs.
2. A fields index file (extension .fdx).
  - FieldsIndex (.fdx) --> <Header>, <ChunkIndex>
  - Header --> {@link CodecUtil#writeHeader CodecHeader}
  - ChunkIndex: See {@link CompressingStoredFieldsIndexWriter}
Known limitations

This {@link StoredFieldsFormat} does not support individual documentslarger than (2³¹ - 2¹⁴) bytes. In case this is a problem, you should use another format, such as {@link Lucene40StoredFieldsFormat}.
@lucene.experimental

    dir.close();
  }
  
  @Test
  public void testUpdateOldSegments() throws Exception {
    Codec[] oldCodecs = new Codec[] { new Lucene40RWCodec(), new Lucene41RWCodec(), new Lucene42RWCodec(), new Lucene45RWCodec() };
    Directory dir = newDirectory();
    
    boolean oldValue = OLD_FORMAT_IMPERSONATION_IS_ACTIVE;
    // create a segment with an old Codec
    IndexWriterConfig conf = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random()));

View Full Code Here

    
    dir.close();
  }
  
  public void testUpdateOldSegments() throws Exception {
    Codec[] oldCodecs = new Codec[] { new Lucene40RWCodec(), new Lucene41RWCodec(), new Lucene42RWCodec(), new Lucene45RWCodec() };
    Directory dir = newDirectory();
    
    boolean oldValue = OLD_FORMAT_IMPERSONATION_IS_ACTIVE;
    // create a segment with an old Codec
    IndexWriterConfig conf = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random()));

View Full Code Here

    
    dir.close();
  }
  
  public void testDisableImpersonation() throws Exception {
    Codec[] oldCodecs = new Codec[] { new Lucene40RWCodec(), new Lucene41RWCodec(), new Lucene42RWCodec() };
    Directory dir = newDirectory();
    IndexWriterConfig conf = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random()));
    conf.setCodec(oldCodecs[random().nextInt(oldCodecs.length)]);
    IndexWriter writer = new IndexWriter(dir, conf);

View Full Code Here

    dir.close();
  }
  
  @Test
  public void testUpdateOldSegments() throws Exception {
    Codec[] oldCodecs = new Codec[] { new Lucene40RWCodec(), new Lucene41RWCodec(), new Lucene42RWCodec(), new Lucene45RWCodec() };
    Directory dir = newDirectory();
    
    boolean oldValue = OLD_FORMAT_IMPERSONATION_IS_ACTIVE;
    // create a segment with an old Codec
    IndexWriterConfig conf = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random()));

View Full Code Here

    
    dir.close();
  }
  
  public void testDisableImpersonation() throws Exception {
    Codec[] oldCodecs = new Codec[] { new Lucene40RWCodec(), new Lucene41RWCodec(), new Lucene42RWCodec() };
    Directory dir = newDirectory();
    IndexWriterConfig conf = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random()));
    conf.setCodec(oldCodecs[random().nextInt(oldCodecs.length)]);
    IndexWriter writer = new IndexWriter(dir, conf);

View Full Code Here

    
    dir.close();
  }
  
  public void testUpdateOldSegments() throws Exception {
    Codec[] oldCodecs = new Codec[] { new Lucene40RWCodec(), new Lucene41RWCodec(), new Lucene42RWCodec(), new Lucene45RWCodec() };
    Directory dir = newDirectory();
    
    boolean oldValue = OLD_FORMAT_IMPERSONATION_IS_ACTIVE;
    // create a segment with an old Codec
    IndexWriterConfig conf = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random()));

View Full Code Here

0 1 2 3 4 5 6 7

TOP

Related Classes of org.apache.lucene.codecs.lucene41.Lucene41StoredFieldsFormat

org.apache.lucene.analysis.Analyzer

org.apache.lucene.analysis.MockAnalyzer

org.apache.lucene.analysis.MockFixedLengthPayloadFilter

org.apache.lucene.analysis.MockTokenizer

org.apache.lucene.analysis.MockVariableLengthPayloadFilter

org.apache.lucene.analysis.TokenFilter

org.apache.lucene.analysis.Tokenizer

org.apache.lucene.codecs.blocktree.BlockTreeTermsReader

org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter

org.apache.lucene.codecs.BlockTreeTermsReader

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.