vishalzanzrukia / java-web-crawler

This is open source web crawler example based on Java technologies

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

java-web-crawler

This is open source web crawler example based on Java technologies with following features.

  • Auto Restart after once cycle finished
  • Configuration to set time between two cycles
  • Capability to start crawling process with same state in case of JVM crash/down or Server crash/down where it left while crash/shutdown occurred.
  • Configuration to run crawler processes with different domains.
  • Configuration to set domain wise different set of url filters
  • Configuration to set domain wise different parsers
  • Configuration to set robots.txt rules enable/disable
  • Configuration to set maximum url visit per second
  • Configuration to set maximum depth to visit
  • Configuration to set maximum bytes per page to download
  • Sitemaps parsing support
  • Retry support with parsing

Technology Stack

  • Spring Boot
  • Spring Integration
  • Redis
  • Jsoup
  • ActiveMQ
  • ElasticSearch

NOTE : It's still ongoing project, not ready to use yet.

About

This is open source web crawler example based on Java technologies


Languages

Language:Java 89.9%Language:Shell 5.7%Language:Batchfile 4.4%