Class PartitionTap can be used to write tuple streams out to files and sub-directories based on the values in the current {@link cascading.tuple.Tuple} instance.
The constructor takes a {@link cascading.tap.hadoop.Hfs} {@link cascading.tap.Tap} and a {@link Partition}implementation. This allows Tuple values at given positions to be used as directory names during write operations, and directory names as data during read operations.
The key value here is that there is no need to duplicate data values in the directory names and inside the data files.
So only values declared in the parent Tap will be read or written to the underlying file system files. But fields declared by the {@link Partition} will only be read or written to the directory names. That is, thePartitionTap instance will sink or source the partition fields, plus the parent Tap fields. The partition fields and parent Tap fields do not need to have common field names.
Note that Hadoop can only sink to directories, and all files in those directories are "part-xxxxx" files.
{@code openWritesThreshold} limits the number of open files to be output to. This value defaults to 300 files.Each time the threshold is exceeded, 10% of the least recently used open files will be closed.
PartitionTap will populate a given {@code partition} without regard to case of the values being used. Thusthe resulting paths {@code 2012/June/} and {@code 2012/june/} will likely result in two open files into the samelocation. Forcing the case to be consistent with a custom Partition implementation or an upstream {@link cascading.operation.Function} is recommended, see {@link cascading.operation.expression.ExpressionFunction}.
Though Hadoop has no mechanism to prevent simultaneous writes to a directory from multiple jobs, it doesn't mean its safe to do so. Same is true with the PartitionTap. Interleaving writes to a common parent (root) directory across multiple flows will very likely lead to data loss.