com.sleepycat.je.recovery.Checkpointer
The Checkpointer looks through the tree for internal nodes that must be flushed to the log. Checkpoint flushes must be done in ascending order from the bottom of the tree up. Checkpoint and IN Logging Rules ------------------------------- The checkpoint must log, and make accessible via non-provisional ancestors, all INs that are dirty at CkptStart. If we crash and recover from that CkptStart onward, any IN that became dirty (before the crash) after the CkptStart must become dirty again as the result of replaying the action that caused it to originally become dirty. Therefore, when an IN is dirtied at some point in the checkpoint interval, but is not logged by the checkpoint, the log entry representing the action that dirtied the IN must follow either the CkptStart or the FirstActiveLSN that is recorded in the CkptEnd entry. The FirstActiveLSN is less than or equal to the CkptStart LSN. Recovery will process LNs between the FirstActiveLSN and the end of the log. Other entries are only processed from the CkptStart forward. And provisional entries are not processed. Example: Non-transactional LN logging. We take two actions: 1) log the LN and then 2) dirty the parent BIN. What if the LN is logged before CkptStart and the BIN is dirtied after CkptStart? How do we avoid breaking the rules? The answer is that we log the LN while holding the latch on the parent BIN, and we don't release the latch until after we dirty the BIN. The construction of the checkpoint dirty map requires latching the BIN. Since the LN was logged before CkptStart, the BIN will be dirtied before the checkpointer latches it during dirty map construction. So the BIN will always be included in the dirty map and logged by the checkpoint. Example: Abort. We take two actions: 1) log the abort and then 2) undo the changes, which modifies (dirties) the BIN parents of the undone LNs. There is nothing to prevent logging CkptStart in between these two actions, so how do we avoid breaking the rules? The answer is that we do not unregister the transaction until after the undo phase. So although the BINs may be dirtied by the undo after CkptStart is logged, the FirstActiveLSN will be prior to CkptStart. Therefore, we will process the Abort and replay the action that modifies the BINs. Exception: Lazy migration. The log cleaner will make an IN dirty without logging an action that makes it dirty. This is an exception to the general rule that actions should be logged when they cause dirtiness. The reasons this is safe are: 1. The IN contents are not modified, so there is no information lost if the IN is never logged, or is logged provisionally and no ancestor is logged non-provisionally. 2. If the IN is logged non-provisionally, this will have the side effect of recording the old LSN as being obsolete. However, the general rules for checkpointing and recovery will ensure that the new version is used in the Btree. The new version will either be replayed by recovery or referenced in the active Btree via a non-provisional ancestor. Checkpoint Algorithm -------------------- The final checkpointDirtyMap field is used to hold (in addition to the dirty INs) the state of the checkpoint and highest flush levels. Access to this object is synchronized so that eviction and checkpointing can access it concurrently. When a checkpoint is not active, the state is CkptState.NONE and the dirty map is empty. When a checkpoint runs, we do this: 1. Get set of files from cleaner that can be deleted after this checkpoint. 2. Set checkpointDirtyMap state to DIRTY_MAP_INCOMPLETE, meaning that dirty map construction is in progress. 3. Log CkptStart 4. Construct dirty map, organized by Btree level, from dirty INs in INList. The highest flush levels are calculated during dirty map construction. Set checkpointDirtyMap state to DIRTY_MAP_COMPLETE. 5. Flush INs in dirty map. + First, flush the bottom two levels a sub-tree at a time, where a sub-tree is one IN at level two and all its BIN children. Higher levels (above level two) are logged strictly by level, not using subtrees. o If je.checkpointer.highPriority=false, we log one IN at a time, whether or not the IN is logged as part of a subtree, and do a Btree search for the parent of each IN. o If je.checkpointer.highPriority=true, for the bottom two levels we log each sub-tree in a single call to the LogManager with the parent IN latched, and we only do one Btree search for each level two IN. Higher levels are logged one IN at a time as with highPriority=false. + The Provisional property is set as follows, depending on the level of the IN: o level is max flush level: Provisional.NO o level is bottom level: Provisional.YES o Otherwise (middle levels): Provisional.BEFORE_CKPT_END 6. Flush VLSNIndex cache to make VLSNIndex recoverable. 7. Flush UtilizationTracker (write FileSummaryLNs) to persist all tracked obsolete offsets and utilization summary info, to make this info recoverable. 8. Log CkptEnd 9. Delete cleaned files from step 1. 10. Set checkpointDirtyMap state to NONE. Provisional.BEFORE_CKPT_END --------------------------- See Provisional.java for a description of the relationship between the checkpoint algorithm above and the BEFORE_CKPT_END property. Coordination of Eviction and Checkpointing ------------------------------------------ Eviction can proceed concurrently with all phases of a checkpoint, and eviction may take place concurrently in multiple threads. This concurrency is crucial to avoid blocking application threads that perform eviction and to reduce the amount of eviction required in application threads. Eviction calls Checkpointer.coordinateEvictionWithCheckpoint, which calls DirtyINMap.coordinateEvictionWithCheckpoint, just before logging an IN. coordinateEvictionWithCheckpoint returns whether the IN should be logged provisionally (Provisional.YES) or non-provisionally (Provisional.NO). Other coordination necessary depends on the state of the checkpoint: + NONE: No additional action. o return Provisional.NO + DIRTY_MAP_INCOMPLETE: The parent IN is added to the dirty map, exactly as if it were encountered as dirty in the INList during dirty map construction. o IN level GTE highest flush level: return Provisional.NO o IN level LT highest flush level: return Provisional.YES + DIRTY_MAP_COMPLETE: o IN is root: return Provisional.NO o IN is not root: return Provisional.YES In general this is designed so that eviction will use the same provisional value that would be used by the checkpoint, as if the checkpoint itself were logging the IN. However, there are several conditions where this is not exactly the case. 1. Eviction may log an IN with Provisional.YES when the IN was not dirty at the time of dirty map creation, if it became dirty afterwards. In this case, the checkpointer would not have logged the IN at all. This is safe because the actions that made that IN dirty are logged in the recovery period. 2. Eviction may log an IN with Provisional.YES after the checkpoint has logged it, if it becomes dirty again. In this case the IN is logged twice, which would not have been done by the checkpoint alone. This is safe because the actions that made that IN dirty are logged in the recovery period. 3. An intermediate level IN (not bottom most and not the highest flush level) will be logged by the checkpoint with Provisional.BEFORE_CKPT_END but will be logged by eviction with Provisional.YES. See below for why this is safe. 4. Between checkpoint step 8 (log CkptEnd) and 10 (set checkpointDirtyMap state to NONE), eviction may log an IN with Provisional.YES, although a checkpoint is not strictly active during this interval. See below for why this is safe. It is safe for eviction to log an IN as Provisional.YES for the last two special cases, because this does not cause incorrect recovery behavior. For recovery to work properly, it is only necessary that: + Provisional.NO is used for INs at the max flush level during an active checkpoint. + Provisional.YES or BEFORE_CKPT_END is used for INs below the max flush level, to avoid replaying an IN during recovery that may depend on a file deleted as the result of the checkpoint. You may ask why we don't use Provisional.YES for eviction when a checkpoint is not active. There are two reason, both related to performance: 1. This would be wasteful when an IN is evicted in between checkpoints, and that portion of the log is processed by recovery later, in the event of a crash. The evicted INs would be ignored by recovery, but the actions that caused them to be dirty would be replayed and the INs would be logged again redundantly. 2. Logging a IN provisionally will not count the old LSN as obsolete immediately, so cleaner utilization will be inaccurate until the a non-provisional parent is logged, typically by the next checkpoint. It is always important to keep the cleaner from stalling and spiking, to keep latency and throughput as level as possible. Therefore, it is safe to log with Provisional.YES in between checkpoints, but not desirable. Although we don't do this, it would be safe and optimal to evict with BEFORE_CKPT_END in between checkpoints, because it would be treated by recovery as if it were Provisional.NO. This is because the interval between checkpoints is only processed by recovery if it follows the last CkptEnd, and BEFORE_CKPT_END is treated as Provisional.NO if the IN follows the last CkptEnd. However, it would not be safe to evict an IN with BEFORE_CKPT_END during a checkpoint, when logging of the IN's ancestors does not occur according to the rules of the checkpoint. If this were done, then if the checkpoint completes and is used during a subsequent recovery, an obsolete offset for the old version of the IN will mistakenly be recorded. Below are two cases where BEFORE_CKPT_END is used correctly and one showing how it could be used incorrectly. 1. Correct use of BEFORE_CKPT_END when the checkpoint does not complete. 050 BIN-A 060 IN-B parent of BIN-A 100 CkptStart 200 BIN-A logged with BEFORE_CKPT_END 300 FileSummaryLN with obsolete offset for BIN-A at 050 Crash and recover Recovery will process BIN-A at 200 (it will be considered non-provisional) because there is no following CkptEnd. It is therefore correct that BIN-A at 050 is obsolete. 2. Correct use of BEFORE_CKPT_END when the checkpoint does complete. 050 BIN-A 060 IN-B parent of BIN-A 100 CkptStart 200 BIN-A logged with BEFORE_CKPT_END 300 FileSummaryLN with obsolete offset for BIN-A at 050 400 IN-B parent of BIN-A, non-provisional 500 CkptEnd Crash and recover Recovery will not process BIN-A at 200 (it will be considered provisional) because there is a following CkptEnd, but it will process its parent IN-B at 400, and therefore the BIN-A at 200 will be active in the tree. It is therefore correct that BIN-A at 050 is obsolete. 3. Incorrect use of BEFORE_CKPT_END when the checkpoint does complete. 050 BIN-A 060 IN-B parent of BIN-A 100 CkptStart 200 BIN-A logged with BEFORE_CKPT_END 300 FileSummaryLN with obsolete offset for BIN-A at 050 400 CkptEnd Crash and recover Recovery will not process BIN-A at 200 (it will be considered provisional) because there is a following CkptEnd, but no parent IN-B is logged, and therefore the IN-B at 060 and BIN-A at 050 will be active in the tree. It is therefore incorrect that BIN-A at 050 is obsolete. This last case is what caused the LFNF in SR [#19422], when BEFORE_CKPT_END was mistakenly used for logging evicted BINs via CacheMode.EVICT_BIN. During the checkpoint, we evict BIN-A and log it with BEFORE_CKPT_END, yet neither it nor its parent are part of the checkpoint. After being counted obsolete, we crash and recover. Then the file containing the BIN (BIN-A at 050 above) is cleaned and deleted. During cleaning, it is not migrated because an obsolete offset was previously recorded. The LFNF occurs when trying to access this BIN during a user operation. CacheMode.EVICT_BIN ------------------- Unlike in JE 4.0 where EVICT_BIN was first introduced, in JE 4.1 and later we do not use special rules when an IN is evicted. Since concurrent eviction and checkpointing are supported in JE 4.1, the above rules apply to EVICT_BIN as well as all other types of eviction.