The abstract Reducer class is used to build reducers for the {@link Job}.
Reducers may be distributed inside of the cluster but there is always only one Reducer per key.
Reducers are called in a threadsafe way so internal locking is not required.
Due to the fact that there is only one Reducer per key mapped values needs to be transmitted to one of the cluster nodes. To reduce the traffic costs between the nodes a {@link Combiner} implementation can be added to the call which runs alongsidethe mapper to pre-reduce mapped values into intermediate results.
A simple Reducer implementation could look like that sum-function implementation:
public class SumReducer implements Reducer<String, Integer, Integer> { private int sum = 0; public void reduce( String key, Integer value ) { sum += value; } public Integer finalizeReduce() { return sum; } }@param < KeyIn> key type of the resulting keys @param < ValueIn> value type of the incoming values @param < ValueOut> value type of the reduced values @since 3.2
Reducer
implementations can access the {@link Configuration} for the job via the {@link JobContext#getConfiguration()} method.
Reducer
has 3 primary phases:
The Reducer
copies the sorted output from each {@link Mapper} using HTTP across the network.
The framework merge sorts Reducer
inputs by key
s (since different Mapper
s may have output the same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.The grouping comparator is specified via {@link Job#setGroupingComparatorClass(Class)}. The sort order is controlled by {@link Job#setSortComparatorClass(Class)}.
For example, say that you want to find duplicate web pages and tag them all with the url of the "best" known example. You would set up the job like:In this phase the {@link #reduce(Object,Iterable,Context)}method is called for each <key, (collection of values)>
in the sorted inputs.
The output of the reduce task is typically written to a {@link RecordWriter} via {@link Context#write(Object,Object)}.
The output of the Reducer
is not re-sorted.
Example:
@see Mapper @see Partitionerpublic class IntSumReducerextends Reducer { private IntWritable result = new IntWritable(); public void reduce(Key key, Iterable values, Context context) throws IOException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.collect(key, result); } }
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|