hexpm / hexdocs

Service that manages the static documentation on https://hexdocs.pm

Home Page:https://hexdocs.pm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

404 pages are included in sitemap XML output

florish opened this issue · comments

I accidentally came across a minor issue with the sitemap XML generated for hexdocs.pm packages. For all packages, the 404.html error page is indexed as an <url>. This is probably unintended, as it does not really make sense to have search engines index this page.

Example for the Ecto sitemap:

<urlset ...>
  <url>
    <loc>https://hexdocs.pm/ecto/404.html</loc>
    <lastmod>2023-08-18T13:55:20Z</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.8</priority>
  </url>
  ...
</urlset>

This line of code seems to be close to the source of this issue, but I'm not sure how to fix this. Any guidance is much appreciated!

Additional finding: the same goes for the autogenerated search.html pages, which are also best excluded from search engine results, as they do not contain any useful content.

HexDocs is dumb, it doesn't and it should not assume anything about ExDoc or what generates the content, so I am afraid there isn't much we can do here. :(

Thanks for explaining! Maybe a noindex meta tag in the generated 404.html and search.html pages is an option then, I'll have a look if this is possible somehow.

@florish we could definitely do that. Can you please send a PR to github.com/elixir-lang/ex_doc?

@josevalim Sure, will do!