big-datai / scraper

Distributed web scraper, kafka, spark, and html unit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Distributed web scraper using HtmlUtils

The goal of this project is to scrape web, it works in a simple yet powerfull manner. You can install that project on multiple machines they will read messages from a kafka topic, enrich them with html content and push them back to another topic. Thi project is tested on 50, 000, 000 messages in a few hours that create a stream of 10 TB data an hour.

About

Distributed web scraper, kafka, spark, and html unit

License:Other


Languages

Language:Java 63.3%Language:Scala 35.9%Language:Shell 0.8%