Class AggregateBy is a {@link SubAssembly} that serves two roles for handling aggregate operations.
The first role is as a base class for composable aggregate operations that have a MapReduce Map side optimization for the Reduce side aggregation. For example 'summing' a value within a grouping can be performed partially Map side and completed Reduce side. Summing is associative and commutative.
AggregateBy also supports operations that are not associative/commutative like 'counting'. Counting would result in 'counting' value occurrences Map side but summing those counts Reduce side. (Yes, counting can be transposed to summing Map and Reduce sides by emitting 1's before the first sum, but that's three operations over two, and a hack)
Think of this mechanism as a MapReduce Combiner, but more efficient as no values are serialized, deserialized, saved to disk, and multi-pass sorted in the process, which consume cpu resources in trade of memory and a little or no IO.
Further, Combiners are limited to only associative/commutative operations.
Additionally the Cascading planner can move the Map side optimization to the previous Reduce operation further increasing IO performance (between the preceding Reduce and Map phase which is over HDFS).
The second role of the AggregateBy class is to allow for composition of AggregateBy sub-classes. That is, {@link SumBy} and {@link CountBy} AggregateBy sub-classes can be performedin parallel on the same grouping keys. Custom AggregateBy classes can be created by sub-classing this class and implementing a special {@link Functor} for use on the Map side. Multiple Functor instances are managed by the {@link CompositeFunction}class allowing them all to share the same LRU value map for more efficiency.
AggregateBy instances return {@code argumentFields} which are used internally to control the values passed tointernal Functor instances. If any argumentFields also have {@link java.util.Comparator}s, they will be used to for secondary sorting (see {@link GroupBy} {@code sortFields}. This feature is used by {@link FirstBy} tocontrol which Tuple is seen first for a grouping.
To tune the LRU, set the {@code threshold} value to a high enough value to utilize available memory. Or set adefault value via the {@link #AGGREGATE_BY_THRESHOLD} property. The current default ({@link CompositeFunction#DEFAULT_THRESHOLD}) is {@code 10, 000} unique keys. Note "flushes" from the LRU will be logged in threshold increments along with memoryinformation.
Note using a AggregateBy instance automatically inserts a {@link GroupBy} into the resulting {@link cascading.flow.Flow}. And passing multiple AggregateBy instances to a parent AggregateBy instance still results in one GroupBy.
Also note that {@link Unique} is not a CompositeAggregator and is slightly more optimized internally.
As of Cascading 2.6 AggregateBy honors the {@link cascading.tuple.Hasher} interface for storing keys in the cache.
@see SumBy
@see CountBy
@see Unique