HTTP 302 (to the same URL?) reported as failures
stevecheckoway opened this issue · comments
Given the following document linking to a recent CNN tweet shows the problem.
Here's the input file.
<!DOCTYPE html>
<a href='https://twitter.com/CNN/status/1688986037488398337'>X</a>
Here's the output.
$ htmlproofer /tmp/a.html
Running 3 checks (Images, Links, Scripts) in /tmp/a.html on *.html files ...
Checking 1 external link
Checking 0 internal links
Checking internal link hashes in 0 files
Ran on 1 file!
For the Links > External check, the following failures were found:
* At /tmp/a.html:2:
External link https://twitter.com/CNN/status/1688986037488398337 failed (status code 302)
HTML-Proofer found 1 failure!
Here's the curl output.
$ curl -i https://twitter.com/CNN/status/1688986037488398337
HTTP/2 302
date: Tue, 08 Aug 2023 19:01:40 GMT
perf: 7626143928
vary: Accept
server: tsa_p
location: /CNN/status/1688986037488398337
set-cookie: guest_id_marketing=v1%3A169152130039495206; Max-Age=63072000; Expires=Thu, 07 Aug 2025 19:01:40 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None
set-cookie: guest_id_ads=v1%3A169152130039495206; Max-Age=63072000; Expires=Thu, 07 Aug 2025 19:01:40 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None
set-cookie: personalization_id="v1_EkXMRmMQFQuZSli6TwF04A=="; Max-Age=63072000; Expires=Thu, 07 Aug 2025 19:01:40 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None
set-cookie: guest_id=v1%3A169152130039495206; Max-Age=63072000; Expires=Thu, 07 Aug 2025 19:01:40 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None
content-type: text/plain; charset=utf-8
x-powered-by: Express
cache-control: no-cache, no-store, max-age=0
content-length: 53
x-transaction-id: b969b434337adbe8
strict-transport-security: max-age=631138519
x-response-time: 14
x-connection-hash: 975bcbcbef786c98f324095488fd265dc5d224dec81a059544a8563b0b9c334f
Found. Redirecting to /CNN/status/1688986037488398337
If I had to guess, I'd say it's redirecting to the same location but setting cookies and the website is probably checking if the cookies are set. Indeed, with a little testing, this seems to be exactly what's happening.
If I configure curl
to follow redirects (via -L
), I get an infinite loop. If I tell curl
to use a cookie jar and follow the redirects, it succeeds.
$ curl -iL -b cookiejar -c cookiejar https://twitter.com/CNN/status/1688986037488398337
It seems like two approaches to dealing with this:
- If you get a 302 with the same
Location
header, treat the page as existing (although that won't work with hashes), - Configure the HTTP client to use a cookie jar. Since it's likely to have multiple links to the same pages, it seems reasonable to use the same cookie jar for all requests.
I'm going to close this issue because I figured out that I, as the user, can configure htmlproofer using a cookie jar or not.
From the command line, it is
$ htmlproofer --typhoeus '{ "followlocation": true, "cookiefile": "cookiejar.txt", "cookiejar": "cookiejar.txt" }' /tmp/a.html
From Ruby, the configuration is something like
{
typhoeus: {
followlocation: true,
cookiefile: 'cookiejar.txt',
cookiejar: 'cookiejar.txt'
}
}
It may be worth adding this information to the configuration section of the README.
It may be worth adding this information to the configuration section of the README.
PRs accepted. 😀