A single HLog is used by several HRegions simultaneously.
Each HRegion is identified by a unique long int. HRegions do not need to declare themselves before using the HLog; they simply include their HRegion-id in the {@link #append(Text,Text,Text,TreeMap,long)} or {@link #completeCacheFlush(Text,Text,long)} calls.
An HLog consists of multiple on-disk files, which have a chronological order. As data is flushed to other (better) on-disk structures, the log becomes obsolete. We can destroy all the log messages for a given HRegion-id up to the most-recent CACHEFLUSH message from that HRegion.
It's only practical to delete entire files. Thus, we delete an entire on-disk file F when all of the messages in F have a log-sequence-id that's older (smaller) than the most-recent CACHEFLUSH message for every HRegion that has a message in F.
TODO: Vuk Ercegovac also pointed out that keeping HBase HRegion edit logs in HDFS is currently flawed. HBase writes edits to logs and to a memcache. The 'atomic' write to the log is meant to serve as insurance against abnormal RegionServer exit: on startup, the log is rerun to reconstruct an HRegion's last wholesome state. But files in HDFS do not 'exist' until they are cleanly closed -- something that will not happen if RegionServer exits without running its 'close'.
A single HLog is used by several HRegions simultaneously.
Each HRegion is identified by a unique long int
. HRegions do not need to declare themselves before using the HLog; they simply include their HRegion-id in the append
or completeCacheFlush
calls.
An HLog consists of multiple on-disk files, which have a chronological order. As data is flushed to other (better) on-disk structures, the log becomes obsolete. We can destroy all the log messages for a given HRegion-id up to the most-recent CACHEFLUSH message from that HRegion.
It's only practical to delete entire files. Thus, we delete an entire on-disk file F when all of the messages in F have a log-sequence-id that's older (smaller) than the most-recent CACHEFLUSH message for every HRegion that has a message in F.
Synchronized methods can never execute in parallel. However, between the start of a cache flush and the completion point, appends are allowed but log rolling is not. To prevent log rolling taking place during this period, a separate reentrant lock is used.
There is one HLog per RegionServer. All edits for all Regions carried by a particular RegionServer are entered first in the HLog.
Each HRegion is identified by a unique long int
. HRegions do not need to declare themselves before using the HLog; they simply include their HRegion-id in the append
or completeCacheFlush
calls.
An HLog consists of multiple on-disk files, which have a chronological order. As data is flushed to other (better) on-disk structures, the log becomes obsolete. We can destroy all the log messages for a given HRegion-id up to the most-recent CACHEFLUSH message from that HRegion.
It's only practical to delete entire files. Thus, we delete an entire on-disk file F when all of the messages in F have a log-sequence-id that's older (smaller) than the most-recent CACHEFLUSH message for every HRegion that has a message in F.
Synchronized methods can never execute in parallel. However, between the start of a cache flush and the completion point, appends are allowed but log rolling is not. To prevent log rolling taking place during this period, a separate reentrant lock is used.
To read an HLog, call {@link #getReader(org.apache.hadoop.fs.FileSystem,org.apache.hadoop.fs.Path,org.apache.hadoop.conf.Configuration)}.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|