gjtorikian / html-proofer

Test your rendered HTML files to make sure they're accurate.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reporting broken links when using relative internal links reported as broken

bnorberg opened this issue · comments

Hi, I am running into an issue where HTML Proofer is reporting working relative links as broken.

Example link

<a href="../admin/configuring-load-balancer-endpoints.html">

Example feedback from HTML Proofer

Screenshot 2023-02-25 at 11 05 53 AM

Any suggestions on what could be causing the broken link message?

My only solution to ensure I don't get false positives on relative links is to add '../' links in the ignore-urls option or changing all links to root relative. Neither solution is ideal.

Hm, I can't reproduce this issue. Are you sure configuring-load-balancer-endpoints.html exists? Could you zip up your output for me to examine it, or is there some other way I can reproduce it?

Thanks, for the quick response, gjtorikian.

That page does exist because the link works on the deploy site. Here's the page in the gh-pages branch of the deploy, https://github.com/NetAppDocs/storagegrid-116/blob/gh-pages/admin/configuring-load-balancer-endpoints.html and here is the page that is reference the above link through the relative https://github.com/NetAppDocs/storagegrid-116/blob/gh-pages/admin/configuring-custom-server-certificate-for-storage-node-or-clb.html. Again this is just one example. We are getting the broken link on all .. relative links. If we change <a href="../admin/configuring-load-balancer-endpoints.html"> to <a href="./admin/configuring-load-balancer-endpoints.html"> or <a href="/admin/configuring-load-balancer-endpoints.html">, the link checker will not return a broken link message.

We are using a mix of private and public GitHub build runners and different deployment environments for site builds so it will probably not be possible to replicate what I am seeing.

Is there a possibility that this could be related to a site prefix causing the .. in the relative link to not actually be in the correct directory to find the linked file?? We use a language prefix on our site.

Thanks again.

Sorry, but this is probably going to be a bit of back and forth. 😓

There are already a few tests that check to make sure relative (../) links works; I've just pushed a new test up that tries to mimic your issue and it, too, passes: bd89111

But as I said there are many tests for this. One thing I noticed is that the GitHub page you link to for configuring-load-balancer-endpoints.html has 0 links for ../admin/configuring-load-balancer-endpoints.html, but it does have links for /us-en/storagegrid-116/admin/configuring-load-balancer-endpoints.html. Is this expected?

@gjtorikian, are the tests also covering the case of running on a directory instead of an individual file?

By looking at

path_dot_ext = ""
path_dot_ext = path + @runner.options[:assume_extension] unless blank?(@runner.options[:assume_extension])
base = if absolute_path?(path) # path relative to root
# either overwrite with root_dir; or, if source is directory, use that; or, just get the current file's dirname
@runner.options[:root_dir] || (File.directory?(@runner.current_source) ? @runner.current_source : File.dirname(@runner.current_source))
# relative links, path is a file
elsif File.exist?(File.expand_path(
path,
@runner.current_source,
)) || File.exist?(File.expand_path(path_dot_ext, @runner.current_source))
File.dirname(@runner.current_filename)
# relative links in nested dir, path is a file
elsif File.exist?(File.join(
File.dirname(@runner.current_filename),
path,
)) || File.exist?(File.join(File.dirname(@runner.current_filename), path_dot_ext))
File.dirname(@runner.current_filename)
# relative link, path is a directory
else
@runner.current_filename
end
file = File.join(base, path)

It seems the logic could be different in the two cases since for a "file" check I think @runner.current_source is the same as @runner.current_filename.

Looking at the code, there is perhaps an issue with the following lines, where it would seem more reasonable to use File.dirname(@runner.current_filename) instead of @runner.current_source in File.expand_path :

elsif File.exist?(File.expand_path(
path,
@runner.current_source,
)) || File.exist?(File.expand_path(path_dot_ext, @runner.current_source))
File.dirname(@runner.current_filename)

No worries, @gjtorikian. I totally understand and really appreciate @riccardoporreca's and your help.

We are running the check on a directory. And we do add a prefix during the build process to the the root url of the page for the language code (ie, us-en) and the name of the repository. That is why you are seeing root url like /us-en/storagegrid-116 and not ../admin/configuring-load-balancer-endpoints.html.

I am so sorry. I probably confused you with a bad relative link example on the previous post. This would be a relative link example for the configuring-load-balancer-endpoints.html page that is not working , <a href="../admin/web-browser-requirements.html">.

Maybe this prefix is the problem. When a writer uses a relative link to another file in that repository (ie, ../admin/web-browser-requirements.html") the prefix might be throwing off where in the directory the check is actually running. And maybe @riccardoporreca suggestion can fix this?

Anyway, thanks again for your help.

@bnorberg, I did a quick test myself and the "directory" check seem to support correctly "../" links.

Any chance you can share even a small part of the actual website you run on and see the failures, which as I understand is not the public repo you are pointing to? On the public repo, html-proofer has no issue with ../admin/web-browser-requirements.html (I ran on a local checkout and there is no error on that link). Also important is sharing the options you call html-proofer with, as well as the version you are using. It is hard to investigate without a way of reproducing the error.

@riccardoporreca @gjtorikian thanks again for testing and confirming all's working for you.

We have Ruby 2.6 in our build environment and so are using HTML Proofer version 4.4.3. Here is a gist of our configuration, https://gist.github.com/bnorberg/f22875beb1c5fc4cdee7f7db91f3ee72. You'll see that right now we add ../ links to the ignore-urls options to avoid getting failure reports. We wrap HTML Proofer into a homegrown application and run it as a utility in a component step of a GitHub Actions workflow.

Unfortunately, I cannot share any content from the content repository in which we are seeing the failures because it is pre-public release. But the repo I have shared already, https://github.com/NetAppDocs/storagegrid-116/tree/gh-pages, is very similar content-wise. We build our public sites using GitHub free Ubuntu build runners and GitHub pages. For our internal content, we have private build runners (built identical to Ubuntu GitHub build runners), that deploy to VM behind a firewall.

I will do some testing on our public content with relative links turned on tomorrow to see if this might be a problem limited to our internal environment setup and report back. Thanks again.

Thanks @bnorberg, going in the right direction.

Are you able to reproduce the failure on https://github.com/NetAppDocs/storagegrid-116/tree/gh-pages with a similar configuration, of course without ignoring ../?

If you can provide us with another gist, which you could use at your end to reproduce the issue on https://github.com/NetAppDocs/storagegrid-116/tree/gh-pages, we then have a starting point to reproduce at our end and investigate. Fort this exercise, you may want to specify disable_external: true and run with only checks: ['Links'] to speed things up.

While you do that, it would be great if you could check the error also occurs with the latest v5 html-proofer.

@gjtorikian, @bnorberg, I had an intuition by looking at the way the error is reported in the issue's description at the very top: html-proofer is actually looking for a link without leading ../, paired with my special interest for the html-proofer caching.

With this in mind, I could reproduce the issue even with the same test data introduced in bd89111 by just enabling the cache.

The issue is with cleaned_url, which was introduced in #686 consistently for the external URLs

def retrieve_urls(urls_detected, type)
# if there are no urls, bail
return {} if urls_detected.empty?
urls_detected = urls_detected.transform_keys do |url|
cleaned_url(url)
end
urls_to_check = detect_url_changes(urls_detected, type)
urls_to_check
end

however introducing an inconsistency for the internal URLs, where cleaning is not done when adding to the cache, and should not be done as it seems to strip leading ../.

I have extended the test and am ready for a simple fix

Amazing sleuthing @riccardoporreca, thank you so much.

Released as 5.0.5.

@gjtorikian, @riccardoporreca I created a branch with the latest html proofer release and can confirm that this change does solve our relative link error issue. Thank you so much for your effort and amazing response time. You gentlemen are awesome!