devavinothm / tiny-web-crawler

A tiny web crawler for Python

Repository from Github https://github.comdevavinothm/tiny-web-crawler

Tiny Web Crawler

A simple and efficient web crawler for Python.

Features

Crawl web pages and extract links starting from a root URL recursively
Concurrent workers and custom delay
Handle relative and absolute URLs
Designed with simplicity in mind, making it easy to use and extend for various web crawling tasks

Installation

Install using pip:

pip install tiny-web-crawler

Usage

from tiny_web_crawler.crawler import Spider

root_url = 'http://github.com'
max_links = 2

crawl = Spider(root_url, max_links)
crawl.start()


# Set workers and delay (default: delay is 0.5 sec and verbose is True)
# If you do not want delay, set delay=0

crawl = Spider(root_url='https://github.com', max_links=5, max_workers=5, delay=1, verbose=False)
crawl.start()

Output Format

Crawled output sample for https://github.com

{
    "http://github.com": {
        "urls": [
            "http://github.com/",
            "https://githubuniverse.com/",
            "..."
        ],
    "https://github.com/solutions/ci-cd": {
        "urls": [
            "https://github.com/solutions/ci-cd/",
            "https://githubuniverse.com/",
            "..."
        ]
      }
    }
}

About

A tiny web crawler for Python

MIT License

Languages

Language:Python 100.0%