capjamesg / getsitemap

A Python library that retrieves all URLs in the sitemaps on a website.

Home Page:https://getsitemap.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

getsitemap

Documentation Status

image

image

image

image

getsitemap is a Python library that retrieves all of the URLs that are found in all of the sitemaps on a website.

This project may be useful if you are building a search crawler or sitemap URL status code validators.

You can read the documentation for this project on Read the Docs.

Installation 💻

To get started, pip install `getsitemap`:

pip install getsitemap

Quickstart ⚡

get all URLs recursively in all sitemaps

import getsitemap

urls = getsitemap.get_individual_sitemap("https://jamesg.blog/sitemap.xml")

print(urls)

get all URLs in a single sitemap

import getsitemap

all_urls = getsitemap.retrieve_sitemap_urls("https://sitemap")

print(all_urls)

Code Quality

This library uses tox, pytest, and flake8 to assure code quality.

To run code quality checks, run the following command:

tox

License 👩‍⚖️ ----------

This project is licensed under an MIT License.

Contributing 🛠️

We would love to have your help in improving getsitemap. Have an idea for a new feature or a bug to fix? Leave information in a GitHub Issue to start a discussion!

If you have

Contributors 💻

  • capjamesg

About

A Python library that retrieves all URLs in the sitemaps on a website.

https://getsitemap.readthedocs.io/en/latest/

License:MIT License


Languages

Language:Python 62.5%Language:HTML 37.5%