devavinothm / tiny-web-crawler

A tiny web crawler for Python

Repository from Github https://github.comdevavinothm/tiny-web-crawlerRepository from Github https://github.comdevavinothm/tiny-web-crawler

Tiny Web Crawler

CI Stable Version License: MIT Python Versions Download Stats

A simple and efficient web crawler for Python.

Features

  • Crawl web pages and extract links starting from a root URL recursively
  • Concurrent workers and custom delay
  • Handle relative and absolute URLs
  • Designed with simplicity in mind, making it easy to use and extend for various web crawling tasks

Installation

Install using pip:

pip install tiny-web-crawler

Usage

from tiny_web_crawler.crawler import Spider

root_url = 'http://github.com'
max_links = 2

crawl = Spider(root_url, max_links)
crawl.start()


# Set workers and delay (default: delay is 0.5 sec and verbose is True)
# If you do not want delay, set delay=0

crawl = Spider(root_url='https://github.com', max_links=5, max_workers=5, delay=1, verbose=False)
crawl.start()

Output Format

Crawled output sample for https://github.com

{
    "http://github.com": {
        "urls": [
            "http://github.com/",
            "https://githubuniverse.com/",
            "..."
        ],
    "https://github.com/solutions/ci-cd": {
        "urls": [
            "https://github.com/solutions/ci-cd/",
            "https://githubuniverse.com/",
            "..."
        ]
      }
    }
}

About

A tiny web crawler for Python

License:MIT License


Languages

Language:Python 100.0%