brotli compression not supported?

Question

brotli compression not supported?

cypherlou opened this issue 5 years ago · comments

Robots.fetch() silently fails when the Accept Encoding header contains a compression algorithm (such as br) that is not supported.

Dan Lecocq · Answer 1 · Wed May 22 2019 05:50:51 GMT+0800 (China Standard Time)

Under the hood, it uses requests. Does any of this help? https://github.com/kennethreitz/requests/issues/4525

It seems like with brotli installed and the newest urllib3 and requests, it should work.

Still, I agree that it should not fail silently. I have the bandwidth to review PRs, but lack the power to merge them :-/

Dojono Quiddi Hub · Answer 2 · Thu May 23 2019 13:02:16 GMT+0800 (China Standard Time)

@dlecocq, I will give that a go - many thanks for the suggestion.

I solved the problem by removing that as an accepted encoding method - but thought I would mention it as this behaviour meant that our site tests suggested there was no robots.txt for servers that supported that encoding method while servers that elected to use something else returned successfully.

Apart from repairing the problem, which may well manifest itself elsewhere so worth doing, I think the ability to provide a robots.txt content as a text string or file (ideally both) and to be able to access the content once it is in the possession of the module (regardless of which method was used to get it there; string, file or url) would help in this and other contexts.

Are you saying if I did a PR it wouldn't be accepted or that you can't do it yourself? I ask as I have development resource that might be able to help but I'm don't want to apply it if the PR is going to sit languishing.

Dan Lecocq · Answer 3 · Thu May 23 2019 23:27:30 GMT+0800 (China Standard Time)

Yeah, it's a pretty terrible failure mode since it effectively ignores the presence of such robots.txts. It looks like urllib3.HttpResponse.CONTENT_DECODERS enumerates the supported encoders for us (https://github.com/urllib3/urllib3/blob/64e413f1b2fef86a150ae747f00aab0e2be8e59c/src/urllib3/response.py#L184), so we could reach down in there to always get the supported encodings.

Re: robots.txt content as a string (or file), that seems reasonable to me. One mode of operation that's important to users of this is to be able to keep a compact representation of the parsed rules for a number of sites, dropping the raw content and rules for other bots as soon as the thing is parsed. As long as that remains possible, I don't think you'll find any pushback from Moz.

I was just saying that I can help out with the reviewing, but I'm no longer at Moz so can't press the 'merge' button for you. @lindseyreno is who'd probably end up doing that.

Lindsey · Answer 4 · Fri May 24 2019 01:28:20 GMT+0800 (China Standard Time)

If @dlecocq reviews it, I'll merge it :)

Dojono Quiddi Hub · Answer 5 · Fri May 24 2019 12:52:20 GMT+0800 (China Standard Time)

Thanks @dlecocq and @lindseyreno, really great to get such speedy feedback. I'll talk to one of the lead devs and see if we can get this looked at. They might not want to prioritise as there's a work around for the br encoding failure but we'll try our best.