seantomburke / sitemapper

parses sitemaps for Node.JS

Home Page:https://www.npmjs.com/package/sitemapper

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support robots.txt Sitemaps (plural!) discovery

Abdull opened this issue · comments

The robots.txt standard allows for declaring the location of sitemaps (plural!), e.g. for https://www.nytimes.com/robots.txt :

# ....
User-Agent: omgili
Disallow: /

User-agent: ia_archiver
Disallow: /

Sitemap: https://www.nytimes.com/sitemaps/new/news.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/sitemap.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/collections.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/video.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/cooking.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/recipe-collects.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/new/regions.xml
Sitemap: https://www.nytimes.com/sitemaps/new/best-sellers.xml
Sitemap: https://www.nytimes.com/sitemaps/www.nytimes.com/2016_election_sitemap.xml.gz
Sitemap: https://www.nytimes.com/elections/2018/sitemap
Sitemap: https://www.nytimes.com/wirecutter/sitemapindex.xml

It would be great if sitemapper allowed to process URLs to robots.txt in order to transiently return all Sitemap URLs.