tclaus / web_crawler

Crawls websites recursively. High Performance, with seed DB and store into index. Written in Rust.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Web Crawler

Crawls websites recursively. High Performance, with seed DB and store to index. Written in Rust.

How it works

Web Crawler is a proof-of-concept written in Rust. It reads a website from commandline or a seed database and starts to crawl all reachable sites, adds unknown external sites to a database and reads these. All found sites are added to a elastic search index.

The indexed sites can be accessed by web crawlers companion - the Search UI.

How to setup

This is only valid for Mac - for those who likes windows: Good luck.

  1. Install Rust. The site has a good documentation.
  2. Install Postgres as an App - the most easy way to get Postgres up and running on a mac.
  3. Install and run Postgres Admin
  4. Install Elastic Search and Kibana (optional)
  5. Start all these (Run PostgresApp, Elastic Search)

Now you have the Rust language installed and all databases. You just need a Postgres Database instance and a seed table. Open Postgres Admin UI and create a Database named "webcrawler_dev". Copy contents of create_seed_table.sql to a SQL console in the admin and execute.

How to run

you can now run by

$ cargo run https://my_seed_site.com

A good first-start url my be https://www.t-online.de/ or any site based what you are focussed at. or

$ cargo run

In the first case only your site will be crawled. In the second case the seed database is queried, all entries are scanned and will be crawled. If any external sites are linked, these will be added to this database. So database will grow.

If you have the Search UI installed from Github you now can open a search console on http://localhost:3000 and check your results

TODO / How to help

  • Stabilize. After some thousands crawled sites it hangs
  • Make some real logs. Don't let the log grow infinitely
  • Improve read of seed-table. Update with last crawled date, add a 'max_deeph' factor
  • Add an ignore table. Some external links should never be crawled like Add-Sites, Porn, Illegal stuff and so on. Ignore them early.

License

MIT

Free to use. Be polite and reference to its original creator

Written by Thorsten Claus, Dortmund, Germany

About

Crawls websites recursively. High Performance, with seed DB and store into index. Written in Rust.

License:MIT License


Languages

Language:Rust 100.0%