com.sleepycat.je.rep.vlsn.VLSNIndex
A VLSN (Virtual LSN) is used to identify every log entry shared between members of the replication group. Since a JE log is identified by LSNs, we must have a way to map VLSN->LSNs in order to fetch a replicated log record from the local log, using the VLSN. The VLSNIndex implements those mappings. The VLSNIndex has these responsibilities: Generating new VLSNs. Only masters need to generate VLSNs, but any node may have the potential to be a master. The VLSN sequence must ascend over time and across recoveries, so the VSLN id must be preserved much like the database, node and txn ids. Maintaining the VLSN range. Although each node needs to receive and store each log entry from the replication stream, over time the part of the stream that is stored can be reduced, either by log cleaning, or by syncups which can truncate the replication stream. A node always holds a contiguous portion of the replication stream. The VLSN range identifies that portion by having the start and end VLSNs, as well as key landmarks such as the lastSync-able log entry and the last commit log entry. VLSN range information is used by elections and syncup. Gatekeeper for waiting for the most recently logged entries. Feeders block upon the VLSNIndex when they are trying to fetch the most recently logged entries. These recent log entries are held in a two level cache within the VLSNIndex. A call to VLSNIndex.waitForLsn() goes through this sequence: 1) check the log item stored in the vlsn wait latch, if the call did wait. 2) check the log item cache If both fail, the FeederReader will fetch the required log entry from log buffers or disk Providing the LSN mapping for a log record identified by its VLSN. The Feeders and the syncup protocol both need to retrieve log records by VLSN. To do that, we need an LSN mapping. Mappings are added to VLSNIndex when replicated log entries are written into the local log. Although all mappings are registered, the VLSNIndex does not keep every one, in order to save on disk and in-memory storage. Only a sparse set is kept. When searching for a log entry by VLSN, the caller uses the closest available mapping and then scans the log looking for that entry. The VLSNIndex relies on the assumption that VLSN tagged log entries are ordered and contiguous in the log. That is, the LSN for VLSN 1 is < the LSN for VLSN 2 < LSN for VLSN 3, and there is never a gap in the VLSNs. However, at node syncup, the replication stream may need to be truncated when rolling back a non-committed log entry. We can't literally truncate the log files because the JE logs contain intermingled transactional and non transactional information. Instead, the truncation is done both logically by amending the VLSNIndex, and physically by overmarking those entries in the JE logs. Because of that, a physical dump of the log may show some VLSN tagged entries as duplicate and/or out of order because they're abandoned log entries that are not logically part of the replication stream any more For example, the log can look like this: LSN 100, VLSN 1 LSN 200, VLSN 2 <- overmarked LSN 300, VLSN 3 <- overmarked --- syncup, rollback to VLSN 1, restart at VLSN 2 LSN 400, VLSN 2 LSN 500, VLSN 3 VLSN->LSN mappings are created under the log write latch, which ensures that all VLSN tagged log entries are ordered in the logical replication stream in the log. However, the mapping is added to the VLSNIndex outside the log write latch, so the VLSNIndex database may have a momentary gap. For example, t0- thread 1 logs entry at VLSN=1, LSN=100, within log write latch t1- thread 2 logs entry at VLSN=2, LSN=150, within log write latch t2- thread 3 logs entry at VLSN=3, LSN=200, within log write latch t3- thread 1 calls VLSNIndex.put(VLSN=1/LSN=100) t4- thread 3 calls VLSNIndex.put(VLSN=3/LSN=200) t5- thread 2 calls VLSNIndex.put(VLSN=2/LSN=150) At t4, the VLSNIndex contains 1/100, 3/200, but not 2/150. However, we know that the VLSNIndex always represents a contiguous range of VLSNs, so the fact that 2/150 is not yet is handled, and is just like the case where the VLSNIndex optimized away the mapping in order to keep the index sparse. We do guarantee that the start and end VLSNs in the range have mappings, in order to always be able to provide a LTE and GTE mapping for all valid VLSNs. Because of that, if a VLSN comes out of order, it does not update the range. Persistent storage: The VLSN->LSN mappings in the range are grouped into instances of com.sleepycat.je.util.VLSNBucket. Each bucket knows the first and last VLSN within its mini-range. We observe these invariants - buckets are ordered by VLSN in the database and the bucket cache, - only the last bucket is the target of updates at any time, - a single bucket corresponds to a single file, but a single file may have multiple buckets covering it. While it would be nice to also guarantee that there are no gaps between buckets, ie: bucket(N-1).last == bucket(N).first - 1 bucket(N).last == bucket(N-1).first - 1 it is not possible to do so because we the put() call is not serialized because we don't want to add overhead to the log write latch. In order to permit out of order puts(), and to require that only the last bucket is updated, we must permit gaps between buckets. Buckets are both cached in memory by the VLSNIndex and are stored persistently in a internal, non-replicated database. The database holds key/value pairs where key=bucket.first data=bucket Since the first valid VLSN is 1, key=-1 is reserved for storage of the VLSNRange. Buckets are filled up as new VLSNs arrive (either because they've been generated by write operations on the master, or because they're incoming operations on the replica). They're flushed to disk periodically rather than with every new VLSN, because the update rate would have too much of a performance impact. Since there is this level of caching happening, we must be careful to write in-memory buckets to disk at well known points to support recoverability. The flushing must be instigated by a third party activity, such as checkpointing, rather than by the action of adding a new mapping. That's because mappings are registered by the logging system, and although we are not holding the log write latch at that point, it seems inadvisable to recursively generate another logging call on behalf of the flush. Currently the VLSNIndex is flushed to disk at every checkpoint. It can also optionally happen more often, and (TODO) we may want to do so because we've seen cases where checkpoints take a very long time. Perhaps we should flush when we flip to a new log file? Once written to disk, the buckets are generally not updated. Updates can happen when the range is truncated, such as for syncup rollback, but the system is quiescent at that time. Log cleaning can delete buckets, but will not modify them. The VLSNRange does naturally change often, and that data record does get updated. Recovery: The VLSN database is restored at recovery time just as all other databases are. However, there may be a portion of the VLSN range that was not flushed to disk. At recovery, we piggyback onto the log scanning done and re-track the any mappings found within the recovery range. Those mappings are merged into those stored on disk, so that the VLSNIndex correctly reflects the entire replication stream at startup. For example, suppose a log has: LSN 100 firstActiveLSN 200 Checkpoint start 300 VLSN 78 400 VLSNIndex flushed here 500 Checkpoint end 600 VLSN 79 The VLSNIndex is initially populated with the version of the index found at LSN 400. That doesn't include VLSN 79. A tracking pass is done from checkpoint start -> end of log, which sweeps up VLSN 78 and VLSN 79 into a temporary tracker. That tracker is merged in the VLSNIndex, to update its mappings to VLSN 79. Note that the checkpoint VLSNIndex must encompass all vlsn mappings that are prior to the checkpoint start of that recovery period. This follows the general philosophy that checkpoint flushes all metadata, and recovery reads from checkpoint start onewards to add on any neede extra data. Retrieving mappings: Callers who need to retrieve mappings obtain a VLSNScanner, which acts as a cursor over the VLSNIndex. A VLSNScanner finds and saves the applicable VLSNBucket, and queries the bucket directly as long as it can provide mappings. This reduces the level of contention between multiple readers (feeders) and writers (application threads, or the replay thread) Synchronization hierarchy: To write a new mapping, you must have the mutex on the VLSIndex, and then the tracker, which lets you obtain the correct bucket, and then you must have a mutex on the bucket. To read a mapping, you must have the tracker mutex to obtain the right bucket. If you already have the right bucket in hand, you only need the bucket mutex. In truth, buckets which are not the "currentBucket" are not modified again, so a future optimization would allow for reading a mapping on a finished bucket without synchronization. The VLSNRange is updated as an atomic assignment to a volatile field after taking the mutex on the current bucket. It is read without a mutex, by looking at it as a volatile field. The hierarchy is VLSNIndex -> VLSNTracker -> VLSNBucket VLSNIndex -> VLSNTracker -> VLSNRange VLSNIndex -> VLSNIndex.mappingSynchronizer VLSNIndex.flushSynchronizer -> VLSNTracker Removing mappings vs reading mappings - sync on the range. We also need to consider that fact that callers of the VLSNIndex may be holding other mutex, or IN latches, and that the VLSNIndex methods may do database operations to read or write to the internal VLSN database. That can result in a nested database operation, and we need to be careful to avoid deadlocks. To be safe, we disable critical eviction [#18475] VLSNBucket.writeDatabase(). Writers ------- Allocating a new VLSN: bump() - sync on log write latch Note that since there is no synchronization on the VLSNINdex itself, [allocating new VLSN, logging its entry] and [flushing the vlsn index to disk] is not atomic. See awaitConsistency(). Adding a mapping: put() - sync on VLSNIndex -sync on VLSNTracker to access the right bucket, and possibly create a new bucket. Atomically modify the VLSNRange. Flushing mappings to disk: writeToDatabase() - sync on VLSNIndex.flushSyncrhonizer -> VLSNTracker Replica side syncup truncates the VLSNIndex from the end: - no synchronization needed, the system is quiescent, and we can assume that VLSNs are neither read nor written by other threads. Log cleaning truncates the VLSNIndex from the beginning: We assume that the log cleaner is prohibited from deleting files that are being used for current feeding. We can also assume that the end of the log is not being deleted, and that we're not conflict with put(). We do have to worry about conflicting with backwards scans when executing syncup as a feeder, and with flushing mappings to disk. Shall we disable log file deletion at this point? Steps to take: First change the VLSNRange: - sync on VLSNIndex - atomically modify the VLSNRange to ensure that no readers or writers touch the buckets that will be deleted. - sync on VLSNTracker to delete any dead buckets. Do that before updating the on-disk database, so that we don't lose any buckets to writeToDatabase(). - without synchronization, scan the database and non-transactionally delete any on-disk buckets that are <= the log cleaned file. Readers ------- Active forward feeder checks if a mapping exists, and waits if necessary - read the current VLSNRange w/out a mutex. If not satisfactory - sync on VLSNIndex - sync on VLSNIndex.mappingSynchronizer Active forward feeder reads a mapping: first - getBucket() - sync on VLSNTracker to access the right bucket if bucket is in hand - sync on target bucket to read bucket