We attempt to do the index updates in parallel using a backing threadpool. All threads are daemon threads, so it will not block the region from shutting down.
Implementations of this interface are used to write inverted lists in sequential order, as follows:
{@link #newDocumentRecord()} returns an {@link OutputBitStream} that must be used to write the document-record data. Note that there is no guarantee that the returned {@link OutputBitStream} coincides with the underlying bit stream. Moreover, there is no guarantee as to when the bits will be actually written on the underlying stream, except that when starting a new inverted list, the previous inverted list, if any, will be written onto the underlying stream. @author Paolo Boldi @author Sebastiano Vigna @since 1.2
We attempt to do the index updates in parallel using a backing threadpool. All threads are daemon threads, so it will not block the region from shutting down.
IndexWriter
creates and maintains an index. The create
argument to the {@link #IndexWriter(Directory,Analyzer,boolean,MaxFieldLength) constructor} determines whether a new index is created, or whether an existing index is opened. Note that you can open an index with create=true
even while readers are using the index. The old readers will continue to search the "point in time" snapshot they had opened, and won't see the newly created index until they re-open. There are also {@link #IndexWriter(Directory,Analyzer,MaxFieldLength) constructors}with no create
argument which will create a new index if there is not already an index at the provided path and otherwise open the existing index.
In either case, documents are added with {@link #addDocument(Document) addDocument} and removed with {@link #deleteDocuments(Term)} or {@link #deleteDocuments(Query)}. A document can be updated with {@link #updateDocument(Term,Document) updateDocument} (which just deletesand then adds the entire document). When finished adding, deleting and updating documents, {@link #close() close} should be called.
These changes are buffered in memory and periodically flushed to the {@link Directory} (during the above methodcalls). A flush is triggered when there are enough buffered deletes (see {@link #setMaxBufferedDeleteTerms}) or enough added documents since the last flush, whichever is sooner. For the added documents, flushing is triggered either by RAM usage of the documents (see {@link #setRAMBufferSizeMB}) or the number of added documents. The default is to flush when RAM usage hits 16 MB. For best indexing speed you should flush by RAM usage with a large RAM buffer. Note that flushing just moves the internal buffered state in IndexWriter into the index, but these changes are not visible to IndexReader until either {@link #commit()} or {@link #close} is called. A flush mayalso trigger one or more segment merges which by default run with a background thread so as not to block the addDocument calls (see below for changing the {@link MergeScheduler}).
If an index will not have more documents added for a while and optimal search performance is desired, then either the full {@link #optimize() optimize}method or partial {@link #optimize(int)} method should becalled before the index is closed.
Opening an IndexWriter
creates a lock file for the directory in use. Trying to open another IndexWriter
on the same directory will lead to a {@link LockObtainFailedException}. The {@link LockObtainFailedException}is also thrown if an IndexReader on the same directory is used to delete documents from the index.
Expert: IndexWriter
allows an optional {@link IndexDeletionPolicy} implementation to bespecified. You can use this to control when prior commits are deleted from the index. The default policy is {@link KeepOnlyLastCommitDeletionPolicy} which removes all priorcommits as soon as a new commit is done (this matches behavior before 2.2). Creating your own policy can allow you to explicitly keep previous "point in time" commits alive in the index for some time, to allow readers to refresh to the new commit without having the old commit deleted out from under them. This is necessary on filesystems like NFS that do not support "delete on last close" semantics, which Lucene's "point in time" search normally relies on.
Expert: IndexWriter
allows you to separately change the {@link MergePolicy} and the {@link MergeScheduler}. The {@link MergePolicy} is invoked whenever there arechanges to the segments in the index. Its role is to select which merges to do, if any, and return a {@link MergePolicy.MergeSpecification} describing the merges. Italso selects merges to do for optimize(). (The default is {@link LogByteSizeMergePolicy}. Then, the {@link MergeScheduler} is invoked with the requested merges andit decides when and how to run the merges. The default is {@link ConcurrentMergeScheduler}.
NOTE: if you hit an OutOfMemoryError then IndexWriter will quietly record this fact and block all future segment commits. This is a defensive measure in case any internal state (buffered documents and deletions) were corrupted. Any subsequent calls to {@link #commit()} will throw anIllegalStateException. The only course of action is to call {@link #close()}, which internally will call {@link #rollback()}, to undo any changes to the index since the last commit. You can also just call {@link #rollback()}directly.
NOTE: {@link IndexWriter
} instances are completely threadsafe, meaning multiple threads can call any of its methods, concurrently. If your application requires external synchronization, you should not synchronize on the IndexWriter
instance as this may cause deadlock; use your own (non-Lucene) objects instead.
NOTE: If you call Thread.interrupt()
on a thread that's within IndexWriter, IndexWriter will try to catch this (eg, if it's in a wait() or Thread.sleep()), and will then throw the unchecked exception {@link ThreadInterruptedException}and clear the interrupt status on the thread.
We attempt to do the index updates in parallel using a backing threadpool. All threads are daemon threads, so it will not block the region from shutting down.
IndexWriter
creates and maintains an index. The create
argument to the constructor determines whether a new index is created, or whether an existing index is opened. Note that you can open an index with create=true
even while readers are using the index. The old readers will continue to search the "point in time" snapshot they had opened, and won't see the newly created index until they re-open. There are also constructors with no create
argument which will create a new index if there is not already an index at the provided path and otherwise open the existing index.
In either case, documents are added with addDocument and removed with deleteDocuments. A document can be updated with updateDocument (which just deletes and then adds the entire document). When finished adding, deleting and updating documents, close should be called.
These changes are buffered in memory and periodically flushed to the {@link Directory} (during the above methodcalls). A flush is triggered when there are enough buffered deletes (see {@link #setMaxBufferedDeleteTerms}) or enough added documents since the last flush, whichever is sooner. For the added documents, flushing is triggered either by RAM usage of the documents (see {@link #setRAMBufferSizeMB}) or the number of added documents. The default is to flush when RAM usage hits 16 MB. For best indexing speed you should flush by RAM usage with a large RAM buffer. You can also force a flush by calling {@link #flush}. When a flush occurs, both pending deletes and added documents are flushed to the index. A flush may also trigger one or more segment merges which by default run with a background thread so as not to block the addDocument calls (see below for changing the {@link MergeScheduler}).
The optional autoCommit
argument to the constructors controls visibility of the changes to {@link IndexReader} instances reading the same index.When this is false
, changes are not visible until {@link #close()} is called.Note that changes will still be flushed to the {@link org.apache.lucene.store.Directory} as new files,but are not committed (no new segments_N
file is written referencing the new files) until {@link #close} iscalled. If something goes terribly wrong (for example the JVM crashes) before {@link #close()}, then the index will reflect none of the changes made (it will remain in its starting state). You can also call {@link #abort()}, which closes the writer without committing any changes, and removes any index files that had been flushed but are now unreferenced. This mode is useful for preventing readers from refreshing at a bad time (for example after you've done all your deletes but before you've done your adds). It can also be used to implement simple single-writer transactional semantics ("all or none").
When autoCommit
is true
then every flush is also a commit ( {@link IndexReader}instances will see each flush as changes to the index). This is the default, to match the behavior before 2.2. When running in this mode, be careful not to refresh your readers while optimize or segment merges are taking place as this can tie up substantial disk space.
Regardless of autoCommit
, an {@link IndexReader} or {@link org.apache.lucene.search.IndexSearcher} will only see theindex as of the "point in time" that it was opened. Any changes committed to the index after the reader was opened are not visible until the reader is re-opened.
If an index will not have more documents added for a while and optimal search performance is desired, then the optimize method should be called before the index is closed.
Opening an IndexWriter
creates a lock file for the directory in use. Trying to open another IndexWriter
on the same directory will lead to a {@link LockObtainFailedException}. The {@link LockObtainFailedException}is also thrown if an IndexReader on the same directory is used to delete documents from the index.
Expert: IndexWriter
allows an optional {@link IndexDeletionPolicy} implementation to bespecified. You can use this to control when prior commits are deleted from the index. The default policy is {@link KeepOnlyLastCommitDeletionPolicy} which removes all priorcommits as soon as a new commit is done (this matches behavior before 2.2). Creating your own policy can allow you to explicitly keep previous "point in time" commits alive in the index for some time, to allow readers to refresh to the new commit without having the old commit deleted out from under them. This is necessary on filesystems like NFS that do not support "delete on last close" semantics, which Lucene's "point in time" search normally relies on.
Expert: IndexWriter
allows you to separately change the {@link MergePolicy} and the {@link MergeScheduler}. The {@link MergePolicy} is invoked whenever there arechanges to the segments in the index. Its role is to select which merges to do, if any, and return a {@link MergePolicy.MergeSpecification} describing the merges. Italso selects merges to do for optimize(). (The default is {@link LogByteSizeMergePolicy}. Then, the {@link MergeScheduler} is invoked with the requested merges andit decides when and how to run the merges. The default is {@link ConcurrentMergeScheduler}.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|