schemathesis / schemathesis

Checklist

I checked the FAQ section of the documentation
I looked for similar issues in the issue tracker
I am using the latest version of Schemathesis

Describe the bug

JSON parsing of responses containing funky Unicode characters fails.

To Reproduce

🚨 Mandatory 🚨: Steps to reproduce the behavior:

A (valid) JSON response that has some funky Unicode characters in it gets screwed up by Python's 'response.text' and then fails to parse. Here's an example response that trips up the parsing:

{"detail":"multiple errors encountered: Error at "/grant/0/resource": Error at "/name": property "name" is missing\nSchema:\n {\n "properties": {\n "name": {\n "minLength": 1,\n "type": "string"\n }\n },\n "required": [\n "name"\n ],\n "type": "object"\n }\n\nValue:\n {\n "direct_permissions": [],\n "type": "queue",\n "��áå𑄉": null\n }\n","status":400,"title":"Bad Request"}

To GET that particular response real quick in case GitHub doesn't handle the characters well either:

https://echoserver.dev/server?response=N4IgFgpghgJhBOBnEAuA2mkBhA9gOwBcJCBaAFQE8AHCEAGhCiqoBsBLAYygLfwHoq8HACMWEALYBqAFaJ8IALoKGwnDAqpQBarRQgiADwL0QMblFQhgAHVMQCUNi1spb4gK4serCAAIEQkj+eBw47oQIEDAovgCi8IG+3L7WqSB8AObwUIR8AAx88BBy7vAcEKkucQk48EkEKWl8eFDiFWkxgjg08NqNti1tlSC+bIi+4mOIbHgZqXgAyhyQ4lCu1ni+vjYbW1vDXT08xcMxO5t7+2mD7VXnl5fDk3gAMsQZBGCnvgCMdPMPPbDbQ0b7DRAEeAzOZpAEPAC+cK28P+uyBaSKAEd3GwijBvmgkeiBq1biAicoicCdGC0iJpBAOARhnDERt5gA1KAsdwQdYXe7E0y4xkEAD6PUmiGm+EQBMpaKuthBZJiw2xEF5w1RFyVIAAgfqAIcAU8AIBuASF3vnhPM40Wy8LZ-iAIdx3HLUAAWPJ5Z08AhiFy2ABCsF8ACUIBqIbZ4SAUS6HAR3agAEw++FAA

Please include a minimal API schema causing this issue:

This particular example is just a standard Problem JSON response, ala: https://opensource.zalando.com/restful-api-guidelines/models/problem-1.0.1.yaml

Proposed Fix

Apologies for not just making a PR, just don't have the time and it's a very, very small fix... For me, everything works just fine if I change this line:

schemathesis/src/schemathesis/transports/responses.py

Line 38 in cb0a779

return json.loads(response.text)

to:

return json.loads(response.content)

instead, so Python requests doesn't try to do any character code conversion on the response before parsing it as JSON.

Hi! Thank you for reporting and providing the context! I’ll take a look at it today or tomorrow

What is your Python version?

And could you, please, post the exact error that happened inside Schemathesis?

From the response you shared, I see that the first two characters are UTF-8 representation of the U+0081 Unicode codepoint which is a control character that does not have any representation, i.e. it is not printable, so, on the representation level it is common to see U+FFFD (�) which is exactly how it is rendered in your comment.

I assume that first, the .text call decodes those bytes as UTF-8, then takes the printable representation of the string (with �) and not the actual string (with \u0081), but your expectation is to have the actual string, in e.g. checks, etc. Is it something along the lines of what is happening?

At the moment I can't reproduce it:

In [20]: json.loads(r.content) == json.loads(r.text)
Out[20]: True

My requests version is 2.28.1 and urllib3 is 1.26.14

Hi @strangelydim

Does my comment make sense? I'd be happy to dig deeper if I'd have some more info that will help me to reproduce the issue

[BUG] JSON parsing of responses containing goofy Unicode characters fails

Checklist

Describe the bug

To Reproduce

Proposed Fix