eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

R crashes when parsing a heavily nested json.

GentleGhostCoder opened this issue Β· comments

Hi there,
I cannot parse a certain Json, because R crash while parsing.
The Json is a query-response from Prometheus and maps the CPU performance of some servers.

What I have tested is the parse function with various max-simplify-level parameters.
The fminify and is_valid_json functions work fine, but don't help with the fparse problem either.
If I reduce the result array, the parsing works again.
When I use jsonlite to parse it works fine.

Unfortunately, I cannot attach a sample file as it is company data.

The prometheus request looks like this:
https://:/api/v1/query?query=node_cpu_seconds_total%7B%7D"

The response content size ist: 50552672 bytes

The scheme looks like this:
{
"type": "object",
"required": [],
"properties": {
"status": {
"type": "array",
"items": {
"type": "string"
}
},
"data": {
"type": "object",
"required": [],
"properties": {
"resultType": {
"type": "array",
"items": {
"type": "string"
}
},
"result": {
"type": "array", ##### Up to 205808 objects
"items": {
"type": "object",
"required": [],
"properties": {
"metric": {
"type": "object",
"required": [],
"properties": {
"Surname": {
"type": "array",
"items": {
"type": "string"
}
},
"alias": {
"type": "array",
"items": {
"type": "string"
}
},
"cluster": {
"type": "array",
"items": {
"type": "string"
}
},
"cpu": {
"type": "array",
"items": {
"type": "string"
}
},
"datacenter": {
"type": "array",
"items": {
"type": "string"
}
},
"instance": {
"type": "array",
"items": {
"type": "string"
}
},
"job": {
"type": "array",
"items": {
"type": "string"
}
},
"Fashion": {
"type": "array",
"items": {
"type": "string"
}
}
}
},
"value": {
"type": "array",
"items": {
"type": "array",
"items": {
"type": "number"
}
}
}
}
}
}
}
}
}
}

The underlying library (simdjson) should not crash.

Could you provide some thoughts on how the rppsimdjson teammight be able to identify and fix the issue given the information you provided ?

It would be helpful to have a minimally reproducible example. R too should not crash, nor should our glue around simdjson introduce one.

Since I cannot provide a sample file, it is difficult for you to analyze the problem in more depth. I was rather hoping that you could tell me what else I could possibly test or how I could narrow down the problem.

What I can do is create a reproducible sample file that includes generated data for you.

@semmjon The fact that is_valid_json does not crash suggests that the parsing (in C++) works fine. Because rcppsimdjson checks validity by producing a full DOM, it knows that the document can be parsed and materialized as a tree. It actually does so, in full. This narrows down somewhat the problem.

Here is an example where it is already crashing. I could try to analyze more precisely at what array size it crashes.

metric <- '{
            "metric":{
               "__name__":"node_cpu_seconds_total",
               "alias":"datacenteraggregation",
               "cluster":"someCluster",
               "cpu":"0",
               "datacenter":"some-Datacenter",
               "instance":"instance.endpoint:1234",
               "job":"clusters",
               "mode":"iowait"
            },
            "value":[
               12345656.643,
               "12345656.643"
            ]
         }'

test <- paste0('{
   "status":"success",
   "data":{
      "resultType":"vector",
      "result":[
         {
            "metric":{
               "__name__":"node_cpu_seconds_total",
               "alias":"someAlias",
               "cluster":"someCluster",
               "cpu":"0",
               "datacenter":"some-Datacenter",
               "instance":"instance.endpoint:1234",
               "job":"clusters",
               "mode":"idle"
            },
            "value":[
               12345656.643,
               "12345656.643"
            ]
         },
         ',paste(lapply(1:200000,function(x) paste(metric)),collapse=","),'
      ]
   }
}')

test <- RcppSimdJson::fparse(test)

@semmjon Is the 200000 parameter minimal? That is, it only crashes when it is 200000, but does not when you have 100000. Is that it? And you confirm that is_valid_json on this input works, right?

I just tested it again and it started to crash at >65290.
And yes is_valid_json works.

BTW you can use r (without the space) to open an R code segment and (ditto) to close it. I have my hands full right now but I take a look later. Also paging @knapply for good measure.

Hmm weird it just seems to crash on the Rstudio server. not on the local studio. Possibly a problem with the R version?

The Server has R version 3.6.3.
Locally i have 4.0.3.

Is this on purpose:

"value":[
               12345656.643,
               "12345656.643"
            ]

?

Yes, I took it from the original data.

Does not reproduce:

edd@rob:~/git/rcppsimdjson(master)$ head issue63.R
metric <- '{
            "metric":{
               "__name__":"node_cpu_seconds_total",
               "alias":"datacenteraggregation",
               "cluster":"someCluster",
               "cpu":"0",
               "datacenter":"some-Datacenter",
               "instance":"instance.endpoint:1234",
               "job":"clusters",
               "mode":"iowait"
edd@rob:~/git/rcppsimdjson(master)$ 
edd@rob:~/git/rcppsimdjson(master)$ tail issue63.R 
               "12345656.643"
            ]
         },
         ',paste(lapply(1:200000,function(x) paste(metric)),collapse=","),'
      ]
   }
}')

test <- RcppSimdJson::fparse(test)
cat("Still here.\n")
edd@rob:~/git/rcppsimdjson(master)$ 
edd@rob:~/git/rcppsimdjson(master)$ Rscript issue63.R
Still here.
edd@rob:~/git/rcppsimdjson(master)$ 

"works on my machine"

Can you provide a sessionInfo()?

R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] compiler_3.6.3 htmltools_0.5.1 tools_3.6.3 yaml_2.2.1 rmarkdown_2.6 knitr_1.30 xfun_0.20 digest_0.6.27 packrat_0.5.0 rlang_0.4.10 evaluate_0.14 RcppSimdJson_0.1.3

Can you please do what I did and copy your code to a file, add a cat("Done\n") or alike at the end, and run it in a terminal via Rscript.

If that passes it means it's not us but some weird interaction or resource starvation happening with your RStudio session. We can try to narrow it down to a RStudio sessions without xfun, knitr, ... and all those other packages which RcppSimdJson does not need. There should not be an interaction here but one never knows...

In short, we need something reproducible. Which we currently do not have. (Though I appreciate your code snippet. It's a valid first step, but in this case one that allowed us to disprove the claim too.)

sgeist@rstudio:~$ head crash_rcppsimdjson.R

metric <- '{
"metric":{
"name":"node_cpu_seconds_total",
"alias":"datacenteraggregation",
"cluster":"someCluster",
"cpu":"0",
"datacenter":"some-Datacenter",
"instance":"instance.endpoint:1234",
"job":"clusters",
sgeist@rstudio:~$ Rscript crash_rcppsimdjson.R
still here.

Ok it works from console.

After I reinstalled RcppSimdJson the bug was solved πŸ™„
Very mysterious and incomprehensible ... probably not a common case.

Was this the crash?

sessionInfo()
# R version 3.6.3 (2020-02-29)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 18.04.4 LTS
# 
# Matrix products: default
# BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
# LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
# 
# locale:
#   [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
# [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
# 
# attached base packages:
#   [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# loaded via a namespace (and not attached):
#   [1] compiler_3.6.3 tools_3.6.3

RStudio.Version()[c("mode", "version")]
# $mode
# [1] "server"
# 
# $version
# [1] β€˜1.2.5001’

image

Yes
At this point, thank you for your help / support and keep it up (I think the package is great πŸ‘)

Ok, no worries. I'll close this then -- feel free to reopen if it rears its head again.

Did you guys figured out what caused the crash?

No. And as I wrote recently on the rcpp-devel list a propos one of the micro-releases to the the github-hosted repo, there is some apparent instability in the toolchain. My reverse-dependency universe is now ~ 2200 packages. I worked on a branch refactoring some internals last year and stopped it when an initial run showed ~ 10% (give or take) breaking and I stopped it. The refactor is important though (an internal how-to-grow-large objects thing) and @Enchufa2 recently picked up the branch and finished it. We once again had ~ 10% breakage ... but I had just released 1.0.6 and seen that a certain nexus of packages around rstan would fail tests "in an odd way" at run-time. (Failing to compile is more blunt and an abvious API change). So we did that and lo-and-behold the breakage went away. (There was still some related to one or two other CRAN packages.)

So all this sermon just to say that "something" appears to be binary-toolchain-fragile but I do not know what it is. Recompilation helps, and that was the change here too.

RcppSimdJson is a good testbed as simdjson is clean -- we don't schlepp any other depends in. So in short: not sure what it is, but rebuilding makes it go away.