* can use the second parameter to the CrawlController constructor.
*
* Note: if you enable resuming feature and want to start a fresh
* crawl, you need to delete the contents of rootFolder manually.
*/
CrawlController controller = new CrawlController(rootFolder);
/*
* For each crawl, you need to add some seed urls.
* These are the first URLs that are fetched and
* then the crawler starts following links which
* are found in these pages
*/
controller.addSeed("http://www.ics.uci.edu/~yganjisa/");
controller.addSeed("http://www.ics.uci.edu/~lopes/");
controller.addSeed("http://www.ics.uci.edu/");
/*
* Be polite:
* Make sure that we don't send more than 5 requests per
* second (200 milliseconds between requests).
*/
controller.setPolitenessDelay(200);
/*
* Optional:
* You can set the maximum crawl depth here.
* The default value is -1 for unlimited depth
*/
controller.setMaximumCrawlDepth(2);
/*
* Optional:
* You can set the maximum number of pages to crawl.
* The default value is -1 for unlimited depth
*/
controller.setMaximumPagesToFetch(500);
/*
* Do you need to set a proxy?
* If so, you can use:
* controller.setProxy("proxyserver.example.com", 8080);
* OR
* controller.setProxy("proxyserver.example.com", 8080, username, password);
*/
/*
* Note: you can configure several other parameters by modifying
* crawler4j.properties file
*/
/*
* Start the crawl. This is a blocking operation, meaning
* that your code will reach the line after this only when
* crawling is finished.
*/
controller.start(MyCrawler.class, numberOfCrawlers);
}