Disk-based implementation of ReadEndsForMarkDuplicatesMap. A subdirectory of the system tmpdir is created to store files, one for each reference sequence. The reference sequence that is currently being queried (i.e. the sequence for which remove() has been most recently called) is stored in RAM. ReadEnds for all other sequences are stored on disk.
When put() is called for a sequence that is the current one in RAM, the ReadEnds object is merely put into the in-memory map. If put() is called for a sequence ID that is not the current RAM one, the ReadEnds object is appended to the file for that sequence, creating the file if necessary.
When remove() is called for a sequence that is the current one in RAM, remove() is called on the in-memory map. If remove() is called for a sequence other than the current RAM sequence, then the current RAM sequence is written to disk, the new sequence is read from disk into RAM map, and the file for the new sequence is deleted.
If things work properly, and reads are processed in genomic order, records will be written for mates that are in a later sequence. When the mate is reached in the input SAM file, the file that was written will be deleted. This should result in all temporary files being deleted by the time all the reads are processed. The temp directory is marked to be deleted on exit so everything should get cleaned up.
@author alecw@broadinstitute.org