404 pages are included in sitemap XML output
florish opened this issue · comments
I accidentally came across a minor issue with the sitemap XML generated for hexdocs.pm packages. For all packages, the 404.html
error page is indexed as an <url>
. This is probably unintended, as it does not really make sense to have search engines index this page.
Example for the Ecto sitemap:
<urlset ...>
<url>
<loc>https://hexdocs.pm/ecto/404.html</loc>
<lastmod>2023-08-18T13:55:20Z</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
...
</urlset>
This line of code seems to be close to the source of this issue, but I'm not sure how to fix this. Any guidance is much appreciated!
Additional finding: the same goes for the autogenerated search.html
pages, which are also best excluded from search engine results, as they do not contain any useful content.
HexDocs is dumb, it doesn't and it should not assume anything about ExDoc or what generates the content, so I am afraid there isn't much we can do here. :(
Thanks for explaining! Maybe a noindex
meta tag in the generated 404.html
and search.html
pages is an option then, I'll have a look if this is possible somehow.
@florish we could definitely do that. Can you please send a PR to github.com/elixir-lang/ex_doc?
@josevalim Sure, will do!