This is an implementation of the AbstractTracker. It is designed to function with the WUI as well as performing various logging activity.
At the end of each snapshot a line is written to the 'progress-statistics.log' file.
The header of that file is as follows:
[timestamp] [discovered] [queued] [downloaded] [doc/s(avg)] [KB/s(avg)] [dl-failures] [busy-thread] [mem-use-KB]
First there is a
timestamp, accurate down to 1 second.
discovered, queued, downloaded and dl-failures are (respectively) the discovered URI count, pending URI count, successfully fetched count and failed fetch count from the frontier at the time of the snapshot.
KB/s(avg) is the bandwidth usage. We use the total bytes downloaded to calculate average bandwidth usage (KB/sec). Since we also note the value each time a snapshot is made we can calculate the average bandwidth usage during the last snapshot period to gain a "current" rate. The first number is the current and the average is in parenthesis.
doc/s(avg) works the same way as doc/s except it show the number of documents (URIs) rather then KB downloaded.
busy-threads is the total number of ToeThreads that are not available (and thus presumably busy processing a URI). This information is extracted from the crawl controller.
Finally mem-use-KB is extracted from the run time environment (Runtime.getRuntime().totalMemory()
).
In addition to the data collected for the above logs, various other data is gathered and stored by this tracker.
- Successfully downloaded documents per fetch status code
- Successfully downloaded documents per document mime type
- Amount of data per mime type
- Successfully downloaded documents per host
- Amount of data per host
- Disposition of all seeds (this is written to 'reports.log' at end of crawl)
- Successfully downloaded documents per host per source
@contributor Parker Thompson
@contributor Kristinn Sigurdsson
@contributor gojomo