gjtorikian / html-proofer

Test your rendered HTML files to make sure they're accurate.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HTMLProofer times out

asbjornu opened this issue · comments

After upgrading to HTMLProofer v4, it has stopped working for our Jekyll-built site. As can be seen in the following build, the build step is canceled before completion, after having run for 6 hours. It halts after the following lines have been logged:

ETHON: performed MULTI
Checking 3341 internal links

I would have run HTMLProofer on the command line if before_request was available there, but as that requires HTMLProofer to be run from Ruby, I've written the following little wrapper class (abbreviated here for brevity):

class Verifier
  def verify(path)
    proofer = HTMLProofer.check_directory(path, options)
    proofer.before_request { |request| before_request(request) }
    proofer.run
  end

  private

  def before_request(request)
    uri = URI(request.base_url)

    return unless uri.host.match('github\.(com|io)$')

    auth = "Bearer #{@auth_token}"
    request.options[:headers]['Authorization'] = auth
  end

  def options
    {
      cache: { timeframe: { external: '1w' } },
      check_html: true,
      check_unrendered_link: true,
      checks: %w[Links Images Scripts UnrenderedLink],
      enforce_https: true,
      log_level: @log_level,
      only_4xx: true,
      parallel: { in_processes: Concurrent.processor_count },
      url_ignore: [
        'https://api.payex.com/',
        'http://www.wikipedia.org',
        'http://restcookbook.com/Basics/loggingin/'
      ]
    }
  end
end

If you can spot anything erroneous about the above code, I would highly appreciate any pointers. To reproduce the problem, clone developer.swedbankpay.com and then run:

# First build the site with Jekyll. Use Docker so Java dependencies, etc., don't have to be installed locally.
# Ignore the version of HTMLProofer being installed here, it's inside the Docker container and unrelated
# to the tests executed with Rake.
docker compose run portal build --site-url=https://developer.swedbankpay.com 
bundle exec rake

If there's anything I can do to help debug this issue, please let me know!

@asbjornu, I was curious about this as I fear you might be hitting performance limitations with internal links checks due to some bottleneck I was looking at at some point.

I investigated by actually downloading the build-site artifact and running locally.

I could easily run only on the index.html via

htmlproofer --disable-external true --log-level debug --checks Links  build-site/index.html --root build-site/

which took >~5 minutes (on my laptop) but completed. Given the hundreds of files, it is not too hard to imagine you are genuinely hitting the 6 hours timeout. Based on previous investigations, I know a key bottleneck is create_nokogiri when checking for existing hash

html = create_nokogiri(url.absolute_path)

I could confirm this by running on the entire site w/o checking internal hashes
`

htmlproofer --disable-external true --log-level debug --checks Links  build-site/ --check-internal-hash false

where the check for internal links now completes in a reasonable time of ~4 minutes.

It might be worth for you to try using check_internal_hash: false at your end to confirm this and at least have something working.

The issue could be mitigated by minimizing the repeated create_nokogiri for the same target internal page, since this is currently re-done for each different hash within the same page, and for the same hash linked from different pages.

@gjtorikian, happy to help out here and draft something

Would it be possible to try out the repo build using this branch? #766

What would be the easiest way to execute HTMLProofer from the command line on the native-async branch, @gjtorikian? For completeness, I ran bundle exec htmlproofer _site on the command line and it also hangs. Canceling it, I get the following stack trace:

Traceback (most recent call last):
	40: from /Users/bitbear/gems/bin/bundle:23:in `<main>'
	39: from /Users/bitbear/gems/bin/bundle:23:in `load'
	38: from /Users/bitbear/gems/gems/bundler-2.3.22/exe/bundle:36:in `<top (required)>'
	37: from /Users/bitbear/gems/gems/bundler-2.3.22/lib/bundler/friendly_errors.rb:120:in `with_friendly_errors'
	36: from /Users/bitbear/gems/gems/bundler-2.3.22/exe/bundle:48:in `block in <top (required)>'
	35: from /Users/bitbear/gems/gems/bundler-2.3.22/lib/bundler/cli.rb:25:in `start'
	34: from /Users/bitbear/gems/gems/bundler-2.3.22/lib/bundler/vendor/thor/lib/thor/base.rb:485:in `start'
	33: from /Users/bitbear/gems/gems/bundler-2.3.22/lib/bundler/cli.rb:31:in `dispatch'
	32: from /Users/bitbear/gems/gems/bundler-2.3.22/lib/bundler/vendor/thor/lib/thor.rb:392:in `dispatch'
	31: from /Users/bitbear/gems/gems/bundler-2.3.22/lib/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'
	30: from /Users/bitbear/gems/gems/bundler-2.3.22/lib/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
	29: from /Users/bitbear/gems/gems/bundler-2.3.22/lib/bundler/cli.rb:486:in `exec'
	28: from /Users/bitbear/gems/gems/bundler-2.3.22/lib/bundler/cli/exec.rb:23:in `run'
	27: from /Users/bitbear/gems/gems/bundler-2.3.22/lib/bundler/cli/exec.rb:58:in `kernel_load'
	26: from /Users/bitbear/gems/gems/bundler-2.3.22/lib/bundler/cli/exec.rb:58:in `load'
	25: from /Users/bitbear/gems/bin/htmlproofer:25:in `<top (required)>'
	24: from /Users/bitbear/gems/bin/htmlproofer:25:in `load'
	23: from /Users/bitbear/gems/gems/html-proofer-4.4.0/bin/htmlproofer:11:in `<top (required)>'
	22: from /Users/bitbear/gems/gems/mercenary-0.4.0/lib/mercenary.rb:21:in `program'
	21: from /Users/bitbear/gems/gems/mercenary-0.4.0/lib/mercenary/program.rb:44:in `go'
	20: from /Users/bitbear/gems/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `execute'
	19: from /Users/bitbear/gems/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `each'
	18: from /Users/bitbear/gems/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `block in execute'
	17: from /Users/bitbear/gems/gems/html-proofer-4.4.0/bin/htmlproofer:97:in `block (2 levels) in <top (required)>'
	16: from /Users/bitbear/gems/gems/html-proofer-4.4.0/lib/html_proofer/runner.rb:46:in `run'
	15: from /Users/bitbear/gems/gems/html-proofer-4.4.0/lib/html_proofer/runner.rb:95:in `check_files'
	14: from /Users/bitbear/gems/gems/html-proofer-4.4.0/lib/html_proofer/runner.rb:145:in `validate_internal_urls'
	13: from /Users/bitbear/gems/gems/html-proofer-4.4.0/lib/html_proofer/url_validator/internal.rb:19:in `validate'
	12: from /Users/bitbear/gems/gems/html-proofer-4.4.0/lib/html_proofer/url_validator/internal.rb:26:in `run_internal_link_checker'
	11: from /Users/bitbear/gems/gems/html-proofer-4.4.0/lib/html_proofer/url_validator/internal.rb:26:in `each_pair'
	10: from /Users/bitbear/gems/gems/html-proofer-4.4.0/lib/html_proofer/url_validator/internal.rb:27:in `block in run_internal_link_checker'
	 9: from /Users/bitbear/gems/gems/html-proofer-4.4.0/lib/html_proofer/url_validator/internal.rb:27:in `each'
	 8: from /Users/bitbear/gems/gems/html-proofer-4.4.0/lib/html_proofer/url_validator/internal.rb:40:in `block (2 levels) in run_internal_link_checker'
	 7: from /Users/bitbear/gems/gems/html-proofer-4.4.0/lib/html_proofer/url_validator/internal.rb:79:in `hash_exists?'
	 6: from /Users/bitbear/gems/gems/html-proofer-4.4.0/lib/html_proofer/url_validator/internal.rb:92:in `find_fragments'
	 5: from /Users/bitbear/gems/gems/html-proofer-4.4.0/lib/html_proofer/utils.rb:22:in `create_nokogiri'
	 4: from /Users/bitbear/gems/gems/nokogiri-1.13.8-x86_64-darwin/lib/nokogiri/html5.rb:31:in `HTML5'
	 3: from /Users/bitbear/gems/gems/nokogiri-1.13.8-x86_64-darwin/lib/nokogiri/html5/document.rb:43:in `parse'
	 2: from /Users/bitbear/gems/gems/nokogiri-1.13.8-x86_64-darwin/lib/nokogiri/html5/document.rb:85:in `do_parse'
	 1: from /Users/bitbear/gems/gems/nokogiri-1.13.8-x86_64-darwin/lib/nokogiri/html5/document.rb:85:in `parse'
/Users/bitbear/gems/gems/nokogiri-1.13.8-x86_64-darwin/lib/nokogiri/xml/document.rb:172:in `initialize': Interrupt
proofer _site(45009,0x114210e00) malloc: *** error for object 0x7f926073ef90: pointer being freed was not allocated
proofer _site(45009,0x114210e00) malloc: *** set a breakpoint in malloc_error_break to debug
Abort trap: 6

@riccardoporreca, with check_internal_hash: false, the HTMLProofer now completes the check. 🎉 Thanks! 🙏🏼

@gjtorikian, I investigated a possible approach to minimize the time-consuming create_nokigiri calls and have a possible working solution in mind: Need to clean it up a bit but will try to draft a PR soon.

See the open PR #770 for the proposed approach, including a link to output generated locally with the proposed solution, showing the effectiveness of the approach on the build-site artifacts from @asbjornu

Thanks to @riccardoporreca, this has now been optimized in 4.4.1.

Thank you so much @riccardoporreca! 🙏🏼 With check_internal_hash: true, HTMLProofer now completes in 24 minutes instead of infinity. So definitely an improvement! 👏🏼 But still a far reach from the 4 minutes it takes with check_internal_hash: false. 🤔