gjtorikian / html-proofer

Test your rendered HTML files to make sure they're accurate.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HTTP 302 (to the same URL?) reported as failures

stevecheckoway opened this issue · comments

Given the following document linking to a recent CNN tweet shows the problem.

Here's the input file.

<!DOCTYPE html>
<a href='https://twitter.com/CNN/status/1688986037488398337'>X</a>

Here's the output.

$ htmlproofer /tmp/a.html
Running 3 checks (Images, Links, Scripts) in /tmp/a.html on *.html files ...


Checking 1 external link
Checking 0 internal links
Checking internal link hashes in 0 files
Ran on 1 file!


For the Links > External check, the following failures were found:

* At /tmp/a.html:2:

  External link https://twitter.com/CNN/status/1688986037488398337 failed (status code 302)


HTML-Proofer found 1 failure!

Here's the curl output.

$  curl -i https://twitter.com/CNN/status/1688986037488398337
HTTP/2 302
date: Tue, 08 Aug 2023 19:01:40 GMT
perf: 7626143928
vary: Accept
server: tsa_p
location: /CNN/status/1688986037488398337
set-cookie: guest_id_marketing=v1%3A169152130039495206; Max-Age=63072000; Expires=Thu, 07 Aug 2025 19:01:40 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None
set-cookie: guest_id_ads=v1%3A169152130039495206; Max-Age=63072000; Expires=Thu, 07 Aug 2025 19:01:40 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None
set-cookie: personalization_id="v1_EkXMRmMQFQuZSli6TwF04A=="; Max-Age=63072000; Expires=Thu, 07 Aug 2025 19:01:40 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None
set-cookie: guest_id=v1%3A169152130039495206; Max-Age=63072000; Expires=Thu, 07 Aug 2025 19:01:40 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None
content-type: text/plain; charset=utf-8
x-powered-by: Express
cache-control: no-cache, no-store, max-age=0
content-length: 53
x-transaction-id: b969b434337adbe8
strict-transport-security: max-age=631138519
x-response-time: 14
x-connection-hash: 975bcbcbef786c98f324095488fd265dc5d224dec81a059544a8563b0b9c334f

Found. Redirecting to /CNN/status/1688986037488398337

If I had to guess, I'd say it's redirecting to the same location but setting cookies and the website is probably checking if the cookies are set. Indeed, with a little testing, this seems to be exactly what's happening.

If I configure curl to follow redirects (via -L), I get an infinite loop. If I tell curl to use a cookie jar and follow the redirects, it succeeds.

$ curl -iL -b cookiejar -c cookiejar https://twitter.com/CNN/status/1688986037488398337

It seems like two approaches to dealing with this:

  1. If you get a 302 with the same Location header, treat the page as existing (although that won't work with hashes),
  2. Configure the HTTP client to use a cookie jar. Since it's likely to have multiple links to the same pages, it seems reasonable to use the same cookie jar for all requests.

I'm going to close this issue because I figured out that I, as the user, can configure htmlproofer using a cookie jar or not.

From the command line, it is

$ htmlproofer --typhoeus '{ "followlocation": true, "cookiefile": "cookiejar.txt", "cookiejar": "cookiejar.txt" }' /tmp/a.html

From Ruby, the configuration is something like

{
    typhoeus: {
      followlocation: true,
      cookiefile: 'cookiejar.txt',
      cookiejar: 'cookiejar.txt'
  }
}

It may be worth adding this information to the configuration section of the README.

It may be worth adding this information to the configuration section of the README.

PRs accepted. 😀