eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Consider upgrading to simdjson 0.4.0

lemire opened this issue · comments

Highlights

  • Test coverage has been greatly improved and we have resolved many static-analysis warnings on different systems.

New features:

  • We added a fast (8GB/s) minifier that works directly on JSON strings.
  • We added fast (10GB/s) UTF-8 validator that works directly on strings (any strings, including non-JSON).
  • The array and object elements have a constant-time size() method.

Performance:

  • Performance improvements to the API (type(), get<>()).
  • The parse_many function (ndjson) has been entirely reworked. It now uses a single secondary thread instead of several new threads.
  • We have introduced a faster UTF-8 validation algorithm (lookup3) for all kernels (ARM, x64 SSE, x64 AVX).

System support:

  • C++11 support for older compilers and systems.
  • FreeBSD support (and tests).
  • We support the clang front-end compiler (clangcl) under Visual Studio.
  • It is now possible to target ARM platforms under Visual Studio.
  • The simdjson library will never abort or print to standard output/error.

I should be able to get to that tomorrow. Really appreciate the heads-up and the attention to the concerns imposed upon us:

The simdjson library will never abort or print to standard output/error.

Does the C++11 support mean we can relax the C++17 requirement ?

Does the C++11 support mean we can relax the C++17 requirement ?

Absolutely. There are some constructs that you cannot use if you lack C++17, but I actually recommend not bothering with C++17 unless you happen to want to use C++ for other reasons.

I just gave that one really quick check and it failed on me missing some includes once I hardwired C++11. I can follow-up with more detail but likely not now...

@eddelbuettel

We have rather extensive C++11 tests in CI and elsewhere... I have been regularly compiling with C++11...

$ c++ -std=c++11 -o amalgamate_demo amalgamate_demo.cpp -Wall -Wextra
lemire@f3bcc52c783e:/Users/lemire/CVS/github/simdjson/singleheader$ ./amalgamate_demo
Please specify at least one file name.

If you get some failure, we'd be very interested in knowing about it...

We (well, mostly @jkeiser) have worked really hard to bring C++11 support. This was no joke. Lots of seriously hard work.

I expect that I'll have to issue 0.4.1, I always do... but, fundamentally, C++11 should work.

If you want to play along:

  1. check our rcppsimdjson
  2. overwrite / repace from your singleheader/simdjson.{h,cpp} into our inst/include/simdjson.{h,cpp}
  3. remove or rename our (shell script) configure (to not test C++ std)
  4. copy src/Makevars.in to src/Makevars and edit line 3 to be CXX_STD = CXX11

Then I got

edd@rob:~/git/rcppsimdjson(feature/simdjson_0.4.0)$ R CMD INSTALL .                                                                                                                                                
* installing to library ‘/usr/local/lib/R/site-library’                                                                                                                                                            
* installing *source* package ‘RcppSimdJson’ ...                                                                                                                                                                   
** using staged installation                                                                                                                                                                                       
** libs                                                                                                                                                                                                            
ccache g++  -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG  -I'/usr/local/lib/R/site-library/Rcpp/include'   -DSIMDJSON_NO_COMPUTED_GOTO -I../inst/include -fopenmp -fpic  -g -O3 -Wall -pipe -pedantic  -c RcppExp
orts.cpp -o RcppExports.o                                                                                                                                                                                          
ccache g++  -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG  -I'/usr/local/lib/R/site-library/Rcpp/include'   -DSIMDJSON_NO_COMPUTED_GOTO -I../inst/include -fopenmp -fpic  -g -O3 -Wall -pipe -pedantic  -c deseria
lize.cpp -o deserialize.o                                                                                                                                                                                          
In file included from ../inst/include/RcppSimdJson/deserialize/simplify.hpp:5,                                                                                                                                     
                 from ../inst/include/RcppSimdJson/deserialize.hpp:5,                                                                                                                                              
                 from ../inst/include/RcppSimdJson.hpp:4,                                                                                                                                                          
                 from deserialize.cpp:1:                                                                                                                                                                           
../inst/include/RcppSimdJson/deserialize/../common.hpp: In function ‘constexpr R_xlen_t rcppsimdjson::r_length(const _Container&)’:                                                                                
../inst/include/RcppSimdJson/deserialize/../common.hpp:16:37: error: ‘size’ is not a member of ‘std’                                                                                                               
   16 |   return static_cast<R_xlen_t>(std::size(__cont));                                                                                                                                                         
      |                                     ^~~~                                                                                                                                                                   
../inst/include/RcppSimdJson/deserialize/../common.hpp: At global scope:                                                                                                                                           
../inst/include/RcppSimdJson/deserialize/../common.hpp:23:8: warning: inline variables are only available with ‘-std=c++17’ or ‘-std=gnu++17’                                                                      
   23 | static inline constexpr int64_t NA_INTEGER64 = LLONG_MIN;                                                                                                                                                  
      |        ^~~~~~        

... plus pages more of course ...

I am not pointing fingers! I fully believe that the ducks are all aligned at your end. We're simply ... swimming in an adjacent pond here and I don't have time now to dig further.

The rcppsimdjson code has multiple instances of c++ 17 only stuff. Structured bindings can be replaced easily and std::optional wouldn’t be much harder, but I don’t think that if constexpr would be trivial to rip out.

And in a way it's fun to have structured bindings and all the newfangled stuff....

The thing that may matter more, as always, is getting RcppSimdJson to the masses of Windows users. Which in the past failed reliably but I also cannot remember why ...

Also, of course, entirely fair point that the C++17 dependency may as this point be self-imposed (based on older example code).

@eddelbuettel At the end of the day it’s your call, but I’m not a fan of moving backward here.

The nicer code aside, I don’t think it would actually expand the number of users.

A bigger priority should be getting this to pass on CRAN for Windows, which would expand the potential user base WAY more. (Okay, second biggest... after getting the exported functions worked out and on CRAN 😬).

I also suspect that Windows would still require Rtools4.0. GCC 4.9.3 isn’t really c++11, but rcppsimdjson works on GCC 7+ and clang 6+ right now.

All of that said... the code is sooo much nicer.

Same :)

But note that in Windows, since R 4.0.0, we now have a new Rtools with g++-8. I need to double check what blows up--likely some other posix-ness that may be hard(er) to fill in.

I’ll boot up the windows side of the computer and modify a one-off DESCRIPTION tomorrow to at least confirm we’re good building the 64bit version.

@lemire, assuming 32 bit is still on the “hidden menu” as you mentioned before #14 (comment) , I may be able to sort out what needs to be tweaked.

Worth a shot 🤞

Well generally we have win-builder for that and a trusted three-line shell script that uploads to win-builder and so I just did (by default to r-release and r-devel) and r-devel just came back with an aborted run, which is ran.

Link per email, valid "roughly 72 hours" as they say: https://win-builder.r-project.org/0GzKWVYr3IFG/ (but boring, seems to have stopped mid-run; could be them too)

Edit Same for r-release. Hm. https://win-builder.r-project.org/wh26UqAAG40R/

@lemire, assuming 32 bit is still on the “hidden menu” as you mentioned before #14 (comment) , I may be able to sort out what needs to be tweaked.

I don't want to encourage Win32 usage which ought to be consider legacy at this point, but yeah... we compile and run all our tests on Win32, see the CI proof...

https://ci.appveyor.com/project/lemire/simdjson-jmmti/branch/master

Let me repeat my rant again:

  • It is probably no longer possible to buy a Windows PC or laptop with a 32-bit Windows.
  • Visual Studio defaults to 64-bit on 64-bit system.

People who are tied to 32-bit binaries should update. We have had 64-bit processors forever. All our phones are 64-bit systems. If you still need to support 16-bit applications, then don't expect to also get access to the latest software.

In the Python universe, for Windows, 32-bit support has been falling left and right... and the net effect is a saner ecosystem where we no longer need to support 15-year-old compilers.

Let us rise up against the oppression of our 32-bit overloads who resist change.

I hear you. But CRAN makes the rules, and they move in mysterious ways. AFAIK they even tried removing 32-bit themselves but encountered pushback from user, who if memory serves were either running older binaries of something or older older drivers.

If it were me I'd tell them to run older R along with it but I don't make "them rules". I just get to play by them for better or worse.

We could play games with them. For example, we could have a massive #ifndef DEADMEAT around it all and then define DEADMEAT in the 32bit build leading to "empty" code. Perfectly valid. People have done worse, ie one "well-known" large and recent project installs and build and then tells you ... that it didn't really install yet and you now get to do the real install. Buttugly but gets them around a 'does not build here' rejection...

Actually, I already do the same / did the same for lack of C++17 compilers. So it's just some minor tuning...

I don't want to encourage Win32 usage which ought to be consider legacy at this point, but yeah... we compile and run all our tests on Win32, see the CI proof...

For typically much better than worse, CRAN is arguably the defining feature of the R ecosystem. The overwhelming majority of potential simdjson users in R land will only ever do so through CRAN. I don't think anyone disagrees with what you're saying and I wouldn't characterize our intentions as "encouraging". We're just dealing with some very R-specific realities.

That said, it's entirely possible no user on a 32-bit system would ever touch this (and we should consider spamming warning()s suggesting alternatives).

I've run all the possible checks with a modified local version. The solution for us getting this on Windows is to check for the 32-bit version of the compiler that R Windows must use (MinGW) in some of the locations that simdjson currently checks for SIMDJSON_REGULAR_VISUAL_STUDIO.

Hard-wiring #define SIMDJSON_REGULAR_VISUAL_STUDIO 1 from our side isn't an option because it enables Visual Studio-specific behaviors, but at least it dies during compilation instead of pretending things are okay.

@lemire I'll open an issue upstream (and try to make the changes on the non-amalgamated files for a PR) for consideration, it's basically a matter of doing this...

#if defined(SIMDJSON_REGULAR_VISUAL_STUDIO) || (defined(__MINGW32__) && (!defined(__MINGW64__)))

... instead of just #ifdef SIMDJSON_REGULAR_VISUAL_STUDIO here and in a few other places.

And for what it is worth I contributed very similar extensions of if defined(SomeVisualCThing) with additional ORs for what MinGW brings to some other projects I have worked with. R is a little special here, but it has its upside as we know we will only ever deal with that one toolchain it provides. It is generally worth flinching one's teeth and swearing once or twice as it gets us in front of what is likely 90% of R users...

@knapply

We do not support MinGW currently, but I would be glad to do so. For us, the condition is to setup CI tests. I can't support something I cannot test systematically.

It looks like appveyor supports MinGW.

People in R land have long tested on Windows from CI as well. I am a bit of a Linux snob and never bothered (why test? then I would need to fix things on Windows oh horror :) as we have win-builder.r-project.org to upload too which I use during the development cycle. Slightly sloppier but easier to maintain. That said. Circle CI was an early choice, AppVeyor became dominant, now it is GH Actions but I am fuzzy here -- people may still ship to AppVeyor from GH Actions. I can probably find you documentation and examples should it be needed.

Now, that is of course of in the context of testing R packages but there is no reason why it can't test C++ builds.

@lemire I assumed MinGW was an option because it's checked for here and here.

My understanding is that it's just so Windows can build using GCC compilers. https://cran.r-project.org/bin/windows/Rtools/

@eddelbuettel I haven't had too much trouble getting GH Actions to work (and it can do so much more than tests, although that's probably true for all of them). I think figuring out {sf}'s system dependenices (gdal, etc.) and {igraph} have been the only sources of pain, but even those seem fine now.

But that's for R packages, it'll take some digging to see what that's like for simdjson specifically.

As a stop-gap, #ifndef DEADMEAT is still on the menu (and would be pretty painless).