Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.
The Hadoop Map-Reduce framework spawns one map task for each {@link InputSplit} generated by the {@link InputFormat} for the job.Mapper
implementations can access the {@link Configuration} for the job via the {@link JobContext#getConfiguration()}.
The framework first calls {@link #setup(org.apache.hadoop.mapreduce.Mapper.Context)}, followed by {@link #map(Object,Object,Context)} for each key/value pair in the InputSplit
. Finally {@link #cleanup(Context)} is called.
All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to a {@link Reducer} to determine the final output. Users can control the sorting and grouping by specifying two key {@link RawComparator} classes.
The Mapper
outputs are partitioned per Reducer
. Users can control which keys (and hence records) go to which Reducer
by implementing a custom {@link Partitioner}.
Users can optionally specify a combiner
, via {@link Job#setCombinerClass(Class)}, to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper
to the Reducer
.
Applications can specify if and how the intermediate outputs are to be compressed and which {@link CompressionCodec}s are to be used via the Configuration
.
If the job has zero reduces then the output of the Mapper
is directly written to the {@link OutputFormat} without sorting by keys.
Example:
public class TokenCounterMapper extends Mapper
Applications may override the {@link #run(Context)} method to exert greater control on map processing e.g. multi-threaded Mapper
s etc.
|
|