eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error in .deserialize_json() - Embedded NUL in string.

GentleGhostCoder opened this issue · comments

Hello everybody,
the Json Parse unfortunately has problems with unicode null.

RcppSimdJson::fparse('{"text":"\u0000"}')

comparison to jsonlite (works):
jsonlite::fromJSON('{"text":"\u0000"}')

Best regards,
Semjon Geist

Different here and both fail:

> jsonlite::fromJSON('{"text":"\u0000"}')
Error: nul character not allowed (line 1)
> 

What OS / platform are you on? For me it is Ubuntu 20.10, "everything current" so jsonlite at 1.7.2.

My mistake, it needs two backslashes

> jsonlite::fromJSON('{"text":"\\u0000"}')
$text
[1] ""

> 

Note that simdjson will happily accept the input...

❯ cat x.json  
{"text":"\u0000"}
❯ ./build/tools/json2json x.json        
{"text":"\u0000"}

Us too, from file:

> RcppSimdJson::fload("/tmp/issue68.json")
$text
[1] "\\u0000"

> 
> str(RcppSimdJson::fload("/tmp/issue68.json"))
List of 1
 $ text: chr "\\u0000"
> 

Hi, I'm sorry I meant that with a double slash. (github automatically encode it with only one)

> RcppSimdJson::fparse('{"text":"\\u0000"}')
Fehler in .deserialize_json(json = json, query = query, empty_array = empty_array,  : 
  Embedded NUL in string.
> jsonlite::fromJSON('{"text":"\\u0000"}')
$text
[1] ""

R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] UPI.etl_1.0.1

loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 knitr_1.31 configr_0.3.5 xml2_1.3.2 magrittr_2.0.1 aws.s3_0.3.21
[7] RcppTOML_0.1.7 aws.signature_0.6.0 R6_2.5.0 rlang_0.4.10 stringr_1.4.0 httr_1.4.2
[13] iotools_0.3-1 tools_3.6.3 parallel_3.6.3 packrat_0.5.0 data.table_1.13.6 xfun_0.20
[19] DBI_1.1.0 htmltools_0.5.1.1 RPostgreSQL_0.6-2 yaml_2.2.1 digest_0.6.27 ini_0.3.1
[25] base64enc_0.1-3 curl_4.3 mime_0.9 glue_1.4.2 evaluate_0.14 RcppSimdJson_0.1.3
[31] rmarkdown_2.6 stringi_1.5.3 compiler_3.6.3
[37] jsonlite_1.7.2

R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] UPI.etl_1.0.1

loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 knitr_1.31 configr_0.3.5 xml2_1.3.2 magrittr_2.0.1 aws.s3_0.3.21
[7] RcppTOML_0.1.7 aws.signature_0.6.0 R6_2.5.0 rlang_0.4.10 stringr_1.4.0 httr_1.4.2
[13] iotools_0.3-1 tools_3.6.3 parallel_3.6.3 packrat_0.5.0 data.table_1.13.6 xfun_0.20
[19] DBI_1.1.0 htmltools_0.5.1.1 RPostgreSQL_0.6-2 yaml_2.2.1 digest_0.6.27 ini_0.3.1
[25] base64enc_0.1-3 curl_4.3 mime_0.9 glue_1.4.2 evaluate_0.14 RcppSimdJson_0.1.3
[31] rmarkdown_2.6 stringi_1.5.3 compiler_3.6.3
[37] jsonlite_1.7.2

The "Embedded NUL in string" error comes from R.

JSON allows null characters inside strings (though they must be escaped). The following is perfectly valid JSON: "\u0000fdsfdsd\u0000".

Here is the specification:

All Unicode characters may be placed within the quotation marks, except for the characters that MUST be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

The string "\u0000" has exactly one character (the nul character).

Not all programming languages can represent a string with a null character in it. In C, you can't have a null character in a string.

I do not know how strings work in R... but it is possible that they can't have null characters in them. If that's the case, then there is no good solution: you can truncate the string, you can give an error... but you cannot represent the string "\u0000fdsfdsd\u0000".

People should not use nulls in their strings if they expect interoperability (with C). The simdjson library does fully support nulls in strings (although we only support it because, annoyingly, the specification says that we must).

It just the interactive part, as we showed the loading from file works fine. Here is a variant (using R 4.0.4, @semmjon cannot do that with 3.6.3) with raw strings:

> RcppSimdJson::fparse(r"({"text":"\\u0000"})")
$text
[1] "\\u0000"

> 

That said as a raw string I should need only one escaping backslash. If I try that I still get a wet towel:

 RcppSimdJson::fparse(r'({"text":"\u0000"})')
Error in .deserialize_json(json = json, query = query, empty_array = empty_array,  : 
  Embedded NUL in string.
> 

@eddelbuettel

It is important to define the problem. Can R strings be made of a single character, the nul character? (String of length 1)

If so, then this should work and return "\0". If not, then this JSON input cannot work.

It is important to distinguish between two inputs. The string of length six or the string of length 1 with a null character.

As I recall yesI think so but I am really between two other things.

Quick grep in the local sources (which everybody can search at GH too at eg https://github.com/r-devel/r-svn) reveals one may not. From Quote.Rd:

## nul characters (for terminating strings in C) are not allowed (parse errors)
\dontrun{% as above, these errors cannot be caught via try*(..)
  "foo\0bar"     # Error: nul character not allowed (line 1)
  "foo\u0000bar" # same error
}

So I will give you a firm "unsure". The other hits of the error string above are in src/main/gram.{y,c}.

@eddelbuettel If I look at @semmjon 's result...

> jsonlite::fromJSON('{"text":"\\u0000"}')
$text
[1] ""

That looks wrong to me. It looks like it returns the empty string. But there is no universe in which "\u0000" is the empty string within JSON. It is either a string with 6 characters, or else the string with one character, the null character.

Yes, there may be several things going on here at once. First off, I think you were right to point out the basics. So "\u...." (using single backslash u) is an error. What we then did with "\u...." (using double backslash u) may just be a different string.

If you want to file a bug report against package jsonlite, you will find it here: https://github.com/jeroen/jsonlite

@semmjon : A more fundamental issue seems to be that

> a <- "\u0000"
Error: nul character not allowed (line 1)
> 

also fails.

Where did your initial example come from? A file? A URL? Maybe the assumption that it should work in R is not the right one?

JSON comes from JavaScript where you have the following...

> "f\u0000fd".length
4

Java, C# and friends will also happily take "f\u0000fd".

It seems that in C and R, the nul character is not allowed so they won't allow the string "f\u0000fd".

What does jsonlite do? Maybe we can experiment a little...

jsonlite::fromJSON('{"text":"f123\\u0000bfdssfsd"}')
$text
[1] "f123"

Can you see what is happening? It truncates the string.

I won't blame jsonlite. What can it do?

Ok, thank you very much I now got a workaround, by removing the null string with gsub.
Jsonlite has a different parsing algorithm and can skip the wrong values.
If you could do something like that in the simdjson parser, it would be good and possibly faster than a gsub over the entire json string.

Jsonlite has a different parsing algorithm and can skip the wrong values.

That is not what it does. It truncates the strings. Please see my example. The string "f123\\u0000bfdssfsd" becomes "f123". Whether it is the correct behaviour really depends on your application domain.

If you could do something like that in the simdjson parser

The simdjson library (C++) itself just returns the string, with nulls and everything. We will not be truncating the strings.

The rcppsimdjson library could decide to truncate the strings when they contain null characters.

This being said, I submit to you that reporting an error by default is probably the desirable behaviour in many cases.

Silently truncating strings without the user knowing that it is happening could result in serious data losses. In an application where you are ingesting data, you could be just discarding whole chunks of data unknowingly. This could result in a corrupted or degraded dataset.

However, it might be reasonable, as an option, to allow string truncation... in which case, it should be happening around here...

Rcpp::String(std::string(element.get<std::string_view>().first));

It really is a language / system mismatch. Just how I sometimes have Rcpp users who want variables with a dot in the name for function arguments on functions interfacing C++ where the dot has a different meaning and does not go into identifiers -- so "can't do".

Here we have JSON allowing \uXXXX. R does not and cannot even assign. So I am with Daniel here: erroring is actually better than silently altering your data. What you are doing with a gsub() is under your control and the appropriate step.

I should note that if you expect your inputs to almost never have a nul character, then you should not apply gsub eagerly. Instead, you should process the inputs as if they did not have nul characters. If you find an error, then fallback on some error processing. If you do it this way and you almost never find nul characters, then you will almost have no performance penalty. On the plus side, you may also log which inputs required special handling.

If you frequently have nul characters, then you should have a chat with whoever is sending you the data.

I have been working for years with lots of JSON files and I have never encountered a nul character in a string yet. It is really not meant to be common.

The problem lies in R itself and the solution is to catch the error. :)
Thanks again