Response data may be incorrectly decoded (not UTF-8)

Question

Response data may be incorrectly decoded (not UTF-8)

ripperdoc opened this issue 4 years ago · comments

So this is in the weeds of encoding, let's see if I can try to make sense of this.

I'm using jsonrpcclient on an API that in a few cases returns data that has mangled encoding (e.g. characters like ä appear as Korean).
The API returns a response header like this:

access-control-allow-origin: *
cache-control: no-cache
content-length: 2458
content-type: application/json
date: Sun, 27 Dec 2020 23:01:42 GMT
server: Cowboy

Note that there is no charset=utf-8 in the content-type, while JSON tends to be assumed to be UTF-8.

The problem I believe comes in client.py, line 173

response.data = parse(
    response.text, batch=batch, validate_against_schema=validate_against_schema
)

This line will parse the data using response.text. According to requests library, the text property is encoded using the provided charset from the response, but if there is no charset, it will guess the encoding using chardet library. chardet can make wrong guesses. So in my cases, chardet for example guessed the encoding to EUC-KR on some results (there is nothing obvious in the returned data that tells me why that guess is made).

The way to solve this according to requests is to set the encoding property manually of the response object, before reading the text property. But that possibility is not exposed when using the jsonrpcclient, as it's done internally in the send method. Also, I would argue it would be a better approach to not parse the json from the response.text property but directly from the bytes in response.content, which solves my problem and gives a more deterministic encoding.

Beau · Answer 1 · Mon Dec 28 2020 11:47:12 GMT+0800 (China Standard Time)

Thanks for reporting. I can see two possible solutions:

We add an response_encoding param to let you specify that.
Another more flexible option is to add a callback param, the callback takes the requests.Response object and lets you do whatever you need, returning a string (the decoded text). The default callback just returns response.text.

Let me know what you think.

Something that might help you for now - the jsonrpcclient response does have a raw attribute which gives you access to the requests.Response, so you should be able to decode the response yourself. However jsonrpcclient will still validate and log the incorrectly decoded text.

Martin Frojd · Answer 2 · Tue Dec 29 2020 01:50:05 GMT+0800 (China Standard Time)

I'm not sure you need to add additional complexity to your library. The JSON standard seems pretty clear that it should always be encoded as UTF, which should apply to JSON-RPC as well. If you agree, that should mean you don't read response.text naively when doing your parsing and validation, but that you explicitly decode it as UTF (either 8, 16 or 32). Simplest way for you might be to use the response.json() method instead, or the underlying Python json.loads(), which should decode it automatically in either UTF format.

Beau · Answer 3 · Tue Dec 29 2020 10:28:04 GMT+0800 (China Standard Time)

By that logic, the requests library should've determined the encoding based on the content-type: application/json header in your response.

To me it makes sense to let requests determine the encoding, but allow the user to override it like:

response = request("http://fruits.com", "get", response_encoding="utf-8")

Which sets the response.encoding attribute before accessing response.text.

Martin Frojd · Answer 4 · Tue Dec 29 2020 16:10:31 GMT+0800 (China Standard Time)

Very well, your choice, as a user I'm happy as long as I can get the encoding right. Thanks!

Beau · Answer 5 · Thu Aug 19 2021 19:56:06 GMT+0800 (China Standard Time)

Won't be a problem in v4.