[BUG] JSON parsing of responses containing goofy Unicode characters fails
strangelydim opened this issue · comments
Checklist
- I checked the FAQ section of the documentation
- I looked for similar issues in the issue tracker
- I am using the latest version of Schemathesis
Describe the bug
JSON parsing of responses containing funky Unicode characters fails.
To Reproduce
🚨 Mandatory 🚨: Steps to reproduce the behavior:
A (valid) JSON response that has some funky Unicode characters in it gets screwed up by Python's 'response.text' and then fails to parse. Here's an example response that trips up the parsing:
{"detail":"multiple errors encountered: Error at "/grant/0/resource": Error at "/name": property "name" is missing\nSchema:\n {\n "properties": {\n "name": {\n "minLength": 1,\n "type": "string"\n }\n },\n "required": [\n "name"\n ],\n "type": "object"\n }\n\nValue:\n {\n "direct_permissions": [],\n "type": "queue",\n "��áå𑄉": null\n }\n","status":400,"title":"Bad Request"}
To GET that particular response real quick in case GitHub doesn't handle the characters well either:
Please include a minimal API schema causing this issue:
This particular example is just a standard Problem JSON response, ala: https://opensource.zalando.com/restful-api-guidelines/models/problem-1.0.1.yaml
Proposed Fix
Apologies for not just making a PR, just don't have the time and it's a very, very small fix... For me, everything works just fine if I change this line:
to:
return json.loads(response.content)
instead, so Python requests doesn't try to do any character code conversion on the response before parsing it as JSON.
Hi! Thank you for reporting and providing the context! I’ll take a look at it today or tomorrow
What is your Python version?
And could you, please, post the exact error that happened inside Schemathesis?
From the response you shared, I see that the first two characters are UTF-8 representation of the U+0081
Unicode codepoint which is a control character that does not have any representation, i.e. it is not printable, so, on the representation level it is common to see U+FFFD
(�
) which is exactly how it is rendered in your comment.
I assume that first, the .text
call decodes those bytes as UTF-8, then takes the printable representation of the string (with �
) and not the actual string (with \u0081
), but your expectation is to have the actual string, in e.g. checks, etc. Is it something along the lines of what is happening?
At the moment I can't reproduce it:
In [20]: json.loads(r.content) == json.loads(r.text)
Out[20]: True
My requests
version is 2.28.1
and urllib3
is 1.26.14
Does my comment make sense? I'd be happy to dig deeper if I'd have some more info that will help me to reproduce the issue