lgraubner / sitemap-generator

Easily create XML sitemaps for your website.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Do not allow AMP version

Cool-Programmer opened this issue · comments

Greetings.
The website i'm currently working on has AMP version of it.
Basically most of the URLs have a second version with striped HTML and they go to https://www.example.com/music/amp (...)
The documentation says that there is no need for including AMP pages in the sitemap, but also i can't disallow them in robots.txt.

What can i do in this situation?

Hi! Those pages should not be added then of course. I can add a filter on either the URL or depending on the content of the pages. First option would be better as it's faster.

Is there any standard on how the URL for AMP should look like?

Hi again.
I've actually implemented it myself by editing /src/discoverResources.js file, adding
// exclude AMP
if (/amp/i.test(href)) { return null; }

Unfortunately, there is no actual standard about how the AMP URL can be structures (i.g. amp.domain.com or domain.com/amp).

The issue for me is currently solved :)

@Cool-Programmer Indeed there is no URL standard format. That being said, proper ways to detect if a webpage is using AMP is (you have to visit the page first):

  1. Detect if its <html> tag has the amp attributes (such as <html ⚡> or <html amp>, see markup of https://www.ampproject.org/docs/getting_started/create/basic_markup).
  2. Detect if the page has a canonical link to a non-amp version (https://www.ampproject.org/docs/fundamentals/discovery).
    Dunno if these properties is ever useful in your case... (I am actually working with the AMP team recently so I can also ask them for alt ways)

@kevinkassimo Thank you for your answer! In this specific case, when i can be sure that the /amp part in the URL can never be changed, the solution is quite simple. As far as i know, if the website has 2 versions (1 amp and 1 common), they can be differentiated by url, but if it has only one amp version, then it gets harder to implement.

Actually we have the content of the page available and can ignore such cases. As I'm reading @kevinkassimo's suggestions I'm thinking about adding both. Pages with canonical links shouldn't be added event if it's not AMP I think.

BTW @Cool-Programmer: For now you can ignore AMP pages without adjusting the source code.

const crawler = generator.getCrawler();
crawler.addFetchCondition((queueItem, referrerQueueItem, callback) => {
  callback(!queueItem.path.match(/\/amp\//));
});

Hopefully I can add the AMP changes today.

I created a feature branch (feature/ignore-amp) with the discussed functionality. @Cool-Programmer would you mind testing if it works with your site?

@lgraubner Worked perfectly, great!

Published a new version containing this feature.