The abstract superclass of all Pipes, which transform one data type to another. Pipes are most often used for feature extraction.
Although Pipe does not have any "abstract methods", in order to use a Pipe subclass you must override either the {@link pipe} method or the {@link newIteratorFrom} method.The former is appropriate when the pipe's processing of an Instance is strictly one-to-one. For every Instance coming in, there is exactly one Instance coming out. The later is appropriate when the pipe's processing may result in more or fewer Instances than arrive through its source iterator.
A pipe operates on an {@link cc.mallet.types.Instance}, which is a carrier of data. A pipe reads from and writes to fields in the Instance when it is requested to process the instance. It is up to the pipe which fields in the Instance it reads from and writes to, but usually a pipe will read its input from and write its output to the "data" field of an instance.
A pipe doesn't have any direct notion of input or output - it merely modifies instances that are handed to it. A set of helper classes, which implement the interface {@link Iterator}, iterate over commonly encountered input data structures and feed the elements of these data structures to a pipe as instances. A pipe is frequently used in conjunction with an {@link cc.mallet.types.InstanceList} As instances are addedto the list, they are processed by the pipe associated with the instance list and the processed Instance is kept in the list.
In one common usage, a {@link cc.mallet.pipe.iterator.FileIterator} is given a list of directories to operate over.The FileIterator walks through each directory, creating an instance for each file and putting the data from the file in the data field of the instance. The directory of the file is stored in the target field of the instance. The FileIterator feeds instances to an InstanceList, which processes the instances through its associated pipe and keeps the results.
Pipes can be hierachically composed. In a typical usage, a SerialPipe is created, which holds other pipes in an ordered list. Piping an instance through a SerialPipe means piping the instance through each of the child pipes in sequence.
A pipe holds two separate Alphabets: one for the symbols (feature names) encountered in the data fields of the instances processed through the pipe, and one for the symbols (e.g. class labels) encountered in the target fields.
@author Andrew McCallum mccallum@cs.umass.edu