Do not allow AMP version

Question

Do not allow AMP version

Cool-Programmer opened this issue 6 years ago · comments

Greetings.
The website i'm currently working on has AMP version of it.
Basically most of the URLs have a second version with striped HTML and they go to https://www.example.com/music/amp (...)
The documentation says that there is no need for including AMP pages in the sitemap, but also i can't disallow them in robots.txt.

What can i do in this situation?

Lars Graubner · Answer 1 · Fri Jul 06 2018 16:53:29 GMT+0800 (China Standard Time)

Hi! Those pages should not be added then of course. I can add a filter on either the URL or depending on the content of the pages. First option would be better as it's faster.

Is there any standard on how the URL for AMP should look like?

Marshall Miziani · Answer 2 · Sun Jul 08 2018 21:38:19 GMT+0800 (China Standard Time)

Hi again.
I've actually implemented it myself by editing /src/discoverResources.js file, adding
// exclude AMP
if (/amp/i.test(href)) { return null; }

Unfortunately, there is no actual standard about how the AMP URL can be structures (i.g. amp.domain.com or domain.com/amp).

The issue for me is currently solved :)

Kevin (Kun) "Kassimo" Qian · Answer 3 · Mon Jul 09 2018 00:42:56 GMT+0800 (China Standard Time)

@Cool-Programmer Indeed there is no URL standard format. That being said, proper ways to detect if a webpage is using AMP is (you have to visit the page first):

Detect if its <html> tag has the amp attributes (such as <html ⚡> or <html amp>, see markup of https://www.ampproject.org/docs/getting_started/create/basic_markup).
Detect if the page has a canonical link to a non-amp version (https://www.ampproject.org/docs/fundamentals/discovery).
Dunno if these properties is ever useful in your case... (I am actually working with the AMP team recently so I can also ask them for alt ways)

Marshall Miziani · Answer 4 · Mon Jul 09 2018 07:31:29 GMT+0800 (China Standard Time)

@kevinkassimo Thank you for your answer! In this specific case, when i can be sure that the /amp part in the URL can never be changed, the solution is quite simple. As far as i know, if the website has 2 versions (1 amp and 1 common), they can be differentiated by url, but if it has only one amp version, then it gets harder to implement.

Lars Graubner · Answer 5 · Mon Jul 09 2018 18:37:37 GMT+0800 (China Standard Time)

Actually we have the content of the page available and can ignore such cases. As I'm reading @kevinkassimo's suggestions I'm thinking about adding both. Pages with canonical links shouldn't be added event if it's not AMP I think.

BTW @Cool-Programmer: For now you can ignore AMP pages without adjusting the source code.

const crawler = generator.getCrawler();
crawler.addFetchCondition((queueItem, referrerQueueItem, callback) => {
  callback(!queueItem.path.match(/\/amp\//));
});

Hopefully I can add the AMP changes today.

Lars Graubner · Answer 6 · Tue Jul 10 2018 03:35:29 GMT+0800 (China Standard Time)

I created a feature branch (feature/ignore-amp) with the discussed functionality. @Cool-Programmer would you mind testing if it works with your site?

Marshall Miziani · Answer 7 · Tue Jul 10 2018 05:46:35 GMT+0800 (China Standard Time)

@lgraubner Worked perfectly, great!

Lars Graubner · Answer 8 · Wed Jul 11 2018 01:34:55 GMT+0800 (China Standard Time)

Published a new version containing this feature.