gbti-labs / py-domain-crawler-and-comparison-tool

This repository contains Python scripts to crawl and compare a website for changes.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Domain Crawler and Comparison Tool

This repository contains Python scripts to crawl and compare a website for changes.

Table of Contents

Overview

  1. capture.py - A web crawler that goes through all the pages of a given domain and exports the URLs, status codes, page sizes, and heights into both .txt and .html formats.
  2. compare.py - A script that takes two .txt files (generated by crawl_website.py), representing the old and new versions of a website, and compares them side by side. It exports the differences into an HTML file, highlighting the discrepancies.

Installation

Installing Python

  • Windows: Download the installer from Python's official site and follow the installation steps. Make sure to check the "Add Python to PATH" checkbox during installation.

  • macOS: Python comes pre-installed on macOS, but you can also download the latest version from Python's official site.

  • Linux: Use your distribution's package manager to install Python. For example, on Ubuntu:

sudo apt-get update
sudo apt-get install python3

Installing Requirements

After installing Python, you need to install the required packages. Navigate to the project folder in your terminal and run:

pip install -r requirements.txt

Usage

  1. Crawling a Website:
python crawl_website.py

Follow the prompts to enter the website domain and select the type of crawl.

  1. Comparing Websites:
python compare.py

Follow the prompts to select the .txt files to be compared.

Why Compare Websites?

  • Migrating to a New Platform/Host: Before switching to a new platform or hosting service, you may want to ensure that all URLs from the old platform exist in the new platform and function as expected.

  • Switching WordPress Themes: A change in theme may result in differences in content display, load times, or even broken links. Comparing the website before and after the switch can highlight these issues.

  • SEO Analysis: Ensuring that URLs, especially high-traffic ones, remain consistent during any changes can help preserve SEO rankings.

  • Quality Assurance: Before rolling out a redesigned website, comparing the old and new sites can help identify bugs, missing content, or other issues that need to be addressed.

About

This repository contains Python scripts to crawl and compare a website for changes.


Languages

Language:Python 100.0%