Postiii / twds-crawler

Highly scalable webcrawler for towardsdatascience.com by using Python, Selenium, Docker, Kubernetes and the infrastructure of the Google Cloud Platform

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

twds-crawler

This repository contains the code to build a highly scalable webcrawler for towardsdatascience.com by using Python, Selenium, Docker, Kubernetes and the infrastructure of the Google Cloud Platform. It was part of a datascience-class to get in touch with some of the most common technologies when it comes to big web- and big data processing.

Documentation

A more detailed description of the implementation can be found in my medium.com article.

Trouble Shooting

Additionally I documented some of my challenges in the trouble-shooting.md

About

Highly scalable webcrawler for towardsdatascience.com by using Python, Selenium, Docker, Kubernetes and the infrastructure of the Google Cloud Platform


Languages

Language:Python 96.0%Language:Shell 3.0%Language:Dockerfile 0.9%