{@code PipelineCallable} is intended to be used to inject auxiliary logic into the controlflow of a Crunch pipeline. This can be used for a number of purposes, such as importing or exporting data into a cluster using Apache Sqoop, executing a legacy MapReduce job or Pig/Hive script within a Crunch pipeline, or sending emails or status notifications about the status of a long-running pipeline during its execution.
The Crunch planner needs to know three things about a {@code PipelineCallable} instance in orderto manage it:
If a given PipelineCallable does not have any dependencies, it will be executed before any jobs are run by the planner. After that, the planner will keep track of when the dependencies of a given instance have been materialized, and then execute the instance as soon as they all exist. The Crunch planner uses a thread pool executor to run multiple {@code PipelineCallable} instances simultaneously, but you canindicate that an instance should be run by itself by overriding the {@code boolean runSingleThreaded()} methodbelow to return true.
The {@code call} method returns a {@code Status} to indicate whether it succeeded or failed. A failedinstance, or any exceptions/errors thrown by the call method, will cause the overall Crunch pipeline containing this instance to fail.
A number of helper methods for accessing the dependent Target/PCollection instances that this instance needs to exist, as well as the {@code Configuration} instance for the overall Pipeline execution, are availableas protected methods in this class so that they may be accessed from implementations of {@code PipelineCallable}within the {@code call} method.
@param < Output> the output value returned by this instance (Void, PCollection, Pair<PCollection, PCollection>,etc.
|
|