schemathesis / schemathesis

Supercharge your API testing, catch bugs, and ensure compliance

Home Page:https://schemathesis.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] JSON parsing of responses containing goofy Unicode characters fails

strangelydim opened this issue · comments

Checklist

  • I checked the FAQ section of the documentation
  • I looked for similar issues in the issue tracker
  • I am using the latest version of Schemathesis

Describe the bug

JSON parsing of responses containing funky Unicode characters fails.

To Reproduce

🚨 Mandatory 🚨: Steps to reproduce the behavior:

A (valid) JSON response that has some funky Unicode characters in it gets screwed up by Python's 'response.text' and then fails to parse. Here's an example response that trips up the parsing:

{"detail":"multiple errors encountered: Error at "/grant/0/resource": Error at "/name": property "name" is missing\nSchema:\n {\n "properties": {\n "name": {\n "minLength": 1,\n "type": "string"\n }\n },\n "required": [\n "name"\n ],\n "type": "object"\n }\n\nValue:\n {\n "direct_permissions": [],\n "type": "queue",\n "��áå𑄉": null\n }\n","status":400,"title":"Bad Request"}

To GET that particular response real quick in case GitHub doesn't handle the characters well either:

https://echoserver.dev/server?response=N4IgFgpghgJhBOBnEAuA2mkBhA9gOwBcJCBaAFQE8AHCEAGhCiqoBsBLAYygLfwHoq8HACMWEALYBqAFaJ8IALoKGwnDAqpQBarRQgiADwL0QMblFQhgAHVMQCUNi1spb4gK4serCAAIEQkj+eBw47oQIEDAovgCi8IG+3L7WqSB8AObwUIR8AAx88BBy7vAcEKkucQk48EkEKWl8eFDiFWkxgjg08NqNti1tlSC+bIi+4mOIbHgZqXgAyhyQ4lCu1ni+vjYbW1vDXT08xcMxO5t7+2mD7VXnl5fDk3gAMsQZBGCnvgCMdPMPPbDbQ0b7DRAEeAzOZpAEPAC+cK28P+uyBaSKAEd3GwijBvmgkeiBq1biAicoicCdGC0iJpBAOARhnDERt5gA1KAsdwQdYXe7E0y4xkEAD6PUmiGm+EQBMpaKuthBZJiw2xEF5w1RFyVIAAgfqAIcAU8AIBuASF3vnhPM40Wy8LZ-iAIdx3HLUAAWPJ5Z08AhiFy2ABCsF8ACUIBqIbZ4SAUS6HAR3agAEw++FAA

Please include a minimal API schema causing this issue:

This particular example is just a standard Problem JSON response, ala: https://opensource.zalando.com/restful-api-guidelines/models/problem-1.0.1.yaml

Proposed Fix

Apologies for not just making a PR, just don't have the time and it's a very, very small fix... For me, everything works just fine if I change this line:

return json.loads(response.text)

to:

return json.loads(response.content)

instead, so Python requests doesn't try to do any character code conversion on the response before parsing it as JSON.

Hi! Thank you for reporting and providing the context! I’ll take a look at it today or tomorrow

What is your Python version?

And could you, please, post the exact error that happened inside Schemathesis?

From the response you shared, I see that the first two characters are UTF-8 representation of the U+0081 Unicode codepoint which is a control character that does not have any representation, i.e. it is not printable, so, on the representation level it is common to see U+FFFD () which is exactly how it is rendered in your comment.

I assume that first, the .text call decodes those bytes as UTF-8, then takes the printable representation of the string (with ) and not the actual string (with \u0081), but your expectation is to have the actual string, in e.g. checks, etc. Is it something along the lines of what is happening?

At the moment I can't reproduce it:

In [20]: json.loads(r.content) == json.loads(r.text)
Out[20]: True

My requests version is 2.28.1 and urllib3 is 1.26.14

Hi @strangelydim

Does my comment make sense? I'd be happy to dig deeper if I'd have some more info that will help me to reproduce the issue