Examples of org.apache.crunch.PipelineCallable

A specialization of {@code Callable} that executes some sequential logic on the client machine aspart of an overall Crunch pipeline in order to generate zero or more outputs, some of which may be {@code PCollection} instances that are processed by other jobs in thepipeline.

{@code PipelineCallable} is intended to be used to inject auxiliary logic into the controlflow of a Crunch pipeline. This can be used for a number of purposes, such as importing or exporting data into a cluster using Apache Sqoop, executing a legacy MapReduce job or Pig/Hive script within a Crunch pipeline, or sending emails or status notifications about the status of a long-running pipeline during its execution.

The Crunch planner needs to know three things about a {@code PipelineCallable} instance in orderto manage it:

The {@code Target} and {@code PCollection} instances that must have been materializedbefore this instance is allowed to run. This information should be specified via the {@code dependsOn}methods of the class.
What Outputs will be created after this instance is executed, if any. These outputs may be new {@code PCollection} instances that are used as inputs in other Crunch jobs. These outputs shouldbe specified by the {@code getOutput(Pipeline)} method of the class, which will be executed immediatelyafter this instance is registered with the {@link Pipeline#sequentialDo} method.
The actual logic to execute when the dependent Targets and PCollections have been created in order to materialize the output data. This is defined in the {@code call} method of the class.

If a given PipelineCallable does not have any dependencies, it will be executed before any jobs are run by the planner. After that, the planner will keep track of when the dependencies of a given instance have been materialized, and then execute the instance as soon as they all exist. The Crunch planner uses a thread pool executor to run multiple {@code PipelineCallable} instances simultaneously, but you canindicate that an instance should be run by itself by overriding the {@code boolean runSingleThreaded()} methodbelow to return true.

The {@code call} method returns a {@code Status} to indicate whether it succeeded or failed. A failedinstance, or any exceptions/errors thrown by the call method, will cause the overall Crunch pipeline containing this instance to fail.

A number of helper methods for accessing the dependent Target/PCollection instances that this instance needs to exist, as well as the {@code Configuration} instance for the overall Pipeline execution, are availableas protected methods in this class so that they may be accessed from implementations of {@code PipelineCallable}within the {@code call} method.

@param < Output> the output value returned by this instance (Void, PCollection, Pair<PCollection, PCollection>,etc.

Examples of org.apache.crunch.PipelineCallable

Related Classes of org.apache.crunch.PipelineCallable