java-web-crawler

Auto Restart after once cycle finished
Configuration to set time between two cycles
Capability to start crawling process with same state in case of JVM crash/down or Server crash/down where it left while crash/shutdown occurred.
Configuration to run crawler processes with different domains.
Configuration to set domain wise different set of url filters
Configuration to set domain wise different parsers
Configuration to set robots.txt rules enable/disable
Configuration to set maximum url visit per second
Configuration to set maximum depth to visit
Configuration to set maximum bytes per page to download
Sitemaps parsing support
Retry support with parsing

NOTE : It's still ongoing project, not ready to use yet.

About

This is open source web crawler example based on Java technologies

Language:Java 89.9%Language:Shell 5.7%Language:Batchfile 4.4%