jiaqi-yin / docker-crawler

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Web Crawler

The project simply builds a web crawler to check and find broken webpages across the whole website.

User Story

As a developer

I want a tool to automatically check all the webpages in the website

So that I can quickly identify if the new features or the bug fixing changes introduced to the website break any existing pages.

Acceptance Criteria

  • All the public facing webpages in the website can be easily located and tested.
  • Any error pages should be logged for further follow-ups.

Getting Started

Add URLs for crawling

In the spider class (e.g: ./mycrawler/spiders/pageavailability.py), replace the example.com URL with a real one for crawling.

Install and Run

This project is tested in MacOS ONLY.

  1. Install Docker for Mac
  2. Clone this project to your local environment.
  3. Run docker-compose up from the top level directory for your project.

This docker-compose up command will start a crawler service and run the crawler for the specified website.

Common Practices

Avoiding getting banned for scraping

About


Languages

Language:Python 100.0%