The FailureDetector API is used to determine a cluster's node availability. Machines and servers can go down at any time and usage of this API can be used by request routing in an attempt to avoid unavailable servers.
A FailureDetector is specific to a given cluster and as such there should only be one instance per cluster per JVM.
Implementations can differ dramatically in how they approach the problem of determining node availability. Some implementations may rely heavily on invocations of recordException and recordSuccess to determine availability. The result is that such a FailureDetector implementation performs little logic other than bookkeeping, implicitly trusting users of the API. However, other implementations may be more selective in using results of any external users' calls to the recordException and recordSuccess methods. Implementations may use these error/success calls as "hints" or may ignore them outright.
To contrast the two approaches to implementing:
- Externally-based implementations use algorithms that rely heavily on users for correctness. For example, let's say a user attempts to contact a node which then fails. A responsible caller should invoke the recordException API to inform the FailureDetector that an error has taken place for the node. The FailureDetector itself hasn't really determined availability itself. So if the caller is incorrect or buggy, the FailureDetector's accuracy is compromised.
- Internally-based implementations rely on their own determination of node availability. For example, a heartbeat style implementation may pay only a modicum of attention when its recordException and/or recordSuccess methods are invoked by outside callers.
Naturally there is a spectrum of implementations and external calls to recordException and recordSuccess should (not must) provide some input to the internal algorithm.
@see voldemort.store.routed.RoutedStore