and
Gurmeet Singh Manku, Sridhar Rajagopalan and Bruce G. Lindsay, Approximate Medians and other Quantiles in One Pass and with Limited Memory, Proc. of the 1998 ACM SIGMOD Int. Conf. on Management of Data, Paper available here.
The broad picture is as follows. Two concepts are used: Shrinking and Sampling. Shrinking takes a data sequence, sorts it and produces a shrinked data sequence by picking every k-th element and throwing away all the rest. The shrinked data sequence is an approximation to the original data sequence.
Imagine a large data sequence (residing on disk or being generated in memory on the fly) and a main memory block of n=b*k elements (b is the number of buffers, k is the number of elements per buffer). Fill elements from the data sequence into the block until it is full or the data sequence is exhausted. When the block (or a subset of buffers) is full and the data sequence is not exhausted, apply shrinking to lossily compress a number of buffers into one single buffer. Repeat these steps until all elements of the data sequence have been consumed. Now the block is a shrinked approximation of the original data sequence. Treating it as if it would be the original data sequence, we can determine quantiles in main memory.
Now, the whole thing boils down to the question of: Can we choose b and k (the number of buffers and the buffer size) such that b*k is minimized, yet quantiles determined upon the block are guaranteed to be away from the true quantiles no more than some epsilon? It turns out, we can. It also turns out that the required main memory block size n=b*k is usually moderate (see the table above).
The theme can be combined with random sampling to further reduce main memory requirements, at the expense of probabilistic guarantees. Sampling filters the data sequence and feeds only selected elements to the algorithm outlined above. Sampling is turned on or off, depending on the parametrization.
This quick overview does not go into important details, such as assigning proper weights to buffers, how to choose subsets of buffers to shrink, etc. For more information consult the papers cited above.
Time Performance:
Performance | ||||||||
Quantiles | Epsilon | Delta | Filling [#elements/sec] | Quantile computation [#quantiles/sec] | ||||
N unknown, Nmax=inf | N known, Nmax=107 | N unknown, Nmax=inf | N known, Nmax=107 | |||||
104 | 10 -1 | 10 -1 | 1600000 | 1300000 | 250000 | 130000 | ||
10 -2 | 360000 | 1200000 | 50000 | 20000 | ||||
10 -3 | 150000 | 200000 | 3600 | 3000 | ||||
10 -4 | 120000 | 170000 | 80 | 1000 |
|
|
|
|