eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

0.1.0 release plan

eddelbuettel opened this issue Β· comments

Many many thanks to @knapply @lemire @jkeiser and whoever else exercised magic to make this work on 32 bit Windows esp in the flavor required by CRAN, notable gcc via mingw and shipped via Rtools.

This sets the stage for a 0.1.0 release. Given that CRAN prefers a weekly cadence I will aim for an upload by the end of next week (i.e. around Jul 3 or 4) which gives us a few days to smooth and polish other aspects. @knapply @dcooley if you have anything 'pending' and possibly working by end of week at it here otherwise I just progress towards an upload.

Very cool to be able to hand the magic of simdjson to a million or two R Windows users. Thanks again for making it happen!

We may have to condition the #pragma away. Building at the (official) win-builder.r-project.org I see a few

../inst/include/RcppSimdJson/deserialize/../../simdjson.h:115:48: note: #pragma message: The simdjson library is designed for 64-bit processors and it seems that you are not compiling for a known 64-bit platform. All fast kernels will be disabled and performance may be poor. Please use a 64-bit target such as x64 or 64-bit ARM.
 use a 64-bit target such as x64 or 64-bit ARM.")

and while I wholeheartedly agree the CRAN Repo Policy may have something against it. Not sure, maybe it can stay.

We may have to condition

Is that doable from our side?

Everything is always just an #ifdef away...

you have anything 'pending' and possibly working by end of week

I need to sort out the user-facing functions. None of the additions are accessible without RcppSimdJson:::.

Here's the prototype that I had put together #10 (comment)

fparse() was just a place holder, but I'm now leaning toward that.

@dcooley, I was hoping you'd chime in. The original discussion was about from_json(), but I assumed you were referring to the functionality rather than the function's name. Did I understand that correctly?

Two other questions:

Do we want separate functions at the user-facing level for JSON that's inside an R character() versus JSON files (including files sitting at a URL)?

  • fparse() (or whatever we decide to call it) parses character()s.
    • analogous to simdjson::dom::parser::parse()
    • fparse_many() would be the multi-JSON (ndjson, jsonl, etc.) equivalent that we'll tackle later
      • analogous to simdjson::dom::parser::parse_many()
    • uses .deserialize_json() under the hood
  • fload() (or whatever we decide to call it) loads and parses files (including from URLs)
    • analogous to simdjson::dom::parser::load()
    • fload_many() would be the multi-JSON equivalent that we tackle later
      • analogous to simdjson::dom::parser::load_many()
    • uses .load_json() under the hood

I'm leaning toward separation as I'd like both functions to be "vectorized", and it's just simpler for the input to be either all files or all strings containing JSON (both to robustly code and for users to understand).

Can I drop GZstream directly into the package to handle .gz files?

I originally saw this in {ndjson} and have since used it everywhere that I might touch compressed files. It prevents the need to inflate gz files in a separate step by essentially swapping out std::ifstream for igzstream.

It makes handling .gz files both easier and faster:

  • no separate "unzip" step
  • no temporary file cleanup
  • no materializing strings as R objects before parsing.

simdjson::parser::load() will take care of us for non-compressed files as is.

Regarding separate fparse and fload: I like that plan.

What does GZstream add that plain gzopen() (from zlib.h) not already offers? Do you need to seek backwards and forwards? I already gzopen() back in the 1990s to just open gzip-compressed files as if they were normal files. Still works the same way...


Oh I see. Stream use. Hm. We could include it. I tend to prefer to keep dependencies minimal. Can what you do be done with a FILE* ?


Also, it's tiny. We could arguably just take it and create RcppGzStream out it making it header only in the process. But it doesn't even seek. Maybe gzopen() ?

What does GZstream add that plain gzopen() (from zlib.h) not already offers?

It's easy? Frankly, it's because I know how to use it and I know it works without much fuss... but that may just be because I don't know any better.

The main reason I've found it nice is that it plays along with std::getline().

This is a bit contrived, but it reflects something I deal with all the time... (the package isn't important, it's just that I already had stuff crammed in this folder as a playground)

// [[Rcpp::plugins(cpp17)]]
// [[Rcpp::depends(tweetio)]]

#include <Rcpp.h>
#include <tweetio.hpp>
#include <tweetio/third-party/simdjson/simdjson.cpp>
#include <tweetio/third-party/gzstream/gzstream.cpp>

template <typename stream_T>                         // just pick the stream
auto keep_valid_json_template(const std::string_view file_path) -> std::vector<std::string> {
  stream_T in_file;

  in_file.open(std::string(file_path).c_str()); // only difference? gzstream wants `const char *`
  auto out = std::vector<std::string>();
  simdjson::dom::parser parser;
  std::string line;

  while (std::getline(in_file, line)) {
    if (auto [_, error] = parser.parse(std::string_view(line)); !error) {
      out.emplace_back(std::move(line));
    }
  }

  in_file.close();
  return out;
}

auto is_gz(const std::string& file_path) -> bool {
  return std::filesystem::path(file_path).extension() == ".gz";
}

// [[Rcpp::export(.keep_valid_json)]]
SEXP keep_valid_json(const std::string& file_path) {
  const auto out = is_gz(file_path) ? keep_valid_json_template<igzstream>(file_path) // easy abstraction over both gz/non-gz files
                                    : keep_valid_json_template<std::ifstream>(file_path);

  return Rcpp::CharacterVector(std::begin(out), std::end(out));
}
bigish_jsonl <-  Sys.glob("~/ufc-tweet-stream.json")
bigish_jsonl.gz <- Sys.glob("~/ufc-tweet-stream.json.gz")

scales::number_bytes(file.size(bigish_jsonl))
#> [1] "1 GiB"
scales::number_bytes(file.size(bigish_jsonl.gz))
#> [1] "122 MiB"

res1 <- .keep_valid_json(bigish_jsonl)
res2 <- .keep_valid_json(bigish_jsonl.gz)
identical(res1, res2)
#> [1] TRUE
scales::comma(length(res1))
#> [1] "140,946"
pryr::object_size(res1)
#> 901 MB

everything_kept <- readLines(bigish_jsonl, warn = FALSE)
pryr::object_size(everything_kept)
#> 1.08 GB
scales::comma(length(everything_kept))
#> [1] "559,067"

The point is that it's just easy to handle regular and gzipped files with minimal code duplication.

I'm thinking specifically about coping with .ndjson/.jsonl files that contain junk, which happens all the time with truly large files. simdjson::dom::parser::load_many() will (rightfully) throw a fit.

By using it now for "vanilla" JSON, we're already set up to use it for NDJSON/JSONL/etc.

Buuuut..... maybe there's an easy way to do this with zlib.h? I have no idea.

I'm game to set up an RcppGzStream package, but it's only two files so dropping it in here directly is simple (and yes, no seek).

It is easy. See eg here for how I handled it in RcppCNPy adding gz support to what was already there -- a little easier as I can pass the file pointer around but I'll try to cook up an example for you. Because as your approach is line-oriented, we should be able to just do it with the C library semantics.

This is embarassingly basic but worked. Took github_events.json, copied it to /tmp and called gzip on it. The following file, to Rcpp::sourceCpp("demo.cpp")-ed, then slurps it and shows it line by line. No compiler or linker switches needed as we get zlib from R. Needs the header to compile though, package zlib1g-dev on Ubuntu. You likely have it already.

One shortcoming is the stack variable, and the fact that it would croak on longer lines.

#include <Rcpp.h>
#include <zlib.h>

// [[Rcpp::export]]
void simpleDemo(std::string & fname) {
  gzFile fp = gzopen(fname.c_str(), "rb");
  if (!fp) {
    Rcpp::stop("npy_gzload: Error! Unable to open file %s!\n",fname.c_str());
  }

  char line[1024];
  while (gzgets(fp, line, sizeof(line)) != NULL ) { // same as fgets()
    std::string s(line);
    Rcpp::Rcout << "Seeing: " << s;
  }
  gzclose(fp);
}

/*** R
simpleDemo("/tmp/github_events.json.gz")  # smaller example file, gzip'ed
*/
commented

@knapply

but I assumed you were referring to the functionality rather than the function's name. Did I understand that correctly?

correct.


@dcooley if you have anything 'pending' and possibly working by end of week at it here otherwise I just progress towards an upload.

Nothing more from me.

Great. End of week looks pretty viable then.

I think we're ready. I put up this ad-hoc script and tossed'em all at RHub:

#!/usr/bin/env Rscript

library(rhub)

selected <- c("debian-clang-devel",
              "debian-gcc-devel",
              "fedora-clang-devel",
              "fedora-gcc-devel",
              #"ubuntu-gcc-devel",  ## uses 16.04 and compiler groans as not C++17 ready
              "macos-highsierra-release",
              "macos-highsierra-release-cran",
              "windows-x86_64-devel",
              "windows-x86_64-release")

check(".", platform = selected, email = getOption("email", "edd@debian.org"))

And apart from the one now-commented out they all passed with flying colours. Time to ship...

Awesome.

Do you have the error logs that ubuntu-gcc-devel/Ubuntu 16.04 spat out? It looks like gcc-5 is what comes standard, and simdjson itself needs gcc>=7, so I'm curious if it's wholly out of our hands.

Sure as I get the emails back. For as long as the time window holds:

RcppSimdJson 0.1.0: PREPERROR

Build ID: RcppSimdJson_0.1.0.tar.gz-43f9a5c528574b2a975b1c09a02de125
Platform: Ubuntu Linux 16.04 LTS, R-devel, GCC
Submitted: 3 minutes 47.6 seconds ago
Build time: 3 minutes 44.7 seconds

See the full build log:
HTML: https://builder.r-hub.io/status/RcppSimdJson_0.1.0.tar.gz-43f9a5c528574b2a975b1c09a02de125
Text: https://builder.r-hub.io/status/original/RcppSimdJson_0.1.0.tar.gz-43f9a5c528574b2a975b1c09a02de125
Artifacts: https://artifacts.r-hub.io/RcppSimdJson_0.1.0.tar.gz-43f9a5c528574b2a975b1c09a02de125

Have questions, suggestions or want to report a bug?
Please file an issue ticket at GitHub at
https://github.com/r-hub/rhub/issues

Thank You for using the R-hub builder.

Also, hold back your awesome til you see the timings @knapply has. Now those are awesome 😎

But I agree with you on the g++ issue for Ubuntu 16.04. That one is probably on me with a task for tweaking condigure for not it letting it lie to me about C++17 when it really truly genuinely seems to fail at it.

Regarding C++11, I hope it's not simdjson itself that has driven you to C++17 :) If you're using it internally a lot, that's cool.

In 0.4 we reworked simdjson's error code API so C++11 feels at least as good as the C++17 structured bindings approach. I actually like it better.

The C++17 error code API exposes some limitations of structured bindings: you have declare 2 new variables. Also, the type has to go on the right (get()) rather than on the left, making it hard to tell what your code does at first glance. C++11 also lets you avoid distracting (to some people) template syntax:

// Error API with C++11 get()
double x, y;
auto error = doc["x"].get(x);
if (error) { cerr << error << endl; exit(1); }
error = doc["y"].get(y);
if (error) { cerr << error << endl; exit(1); }


// This is my preferred syntax for C++11, though it's not everyone's cup of tea:
simdjson::error_code error;
double x, y;
if ((error = doc["x"].get(x))) { cerr << error << endl; exit(1); }
if ((error = doc["y"].get(y);) { cerr << error << endl; exit(1); }
// Error API with C++17 structured bindings
auto [x, error] = doc["x"].get<double>();
if (error) { cerr << error << endl; exit(1); }
auto [y, error2] = doc["y"].get<double>();
if (error2) { cerr << error2 << endl; exit(1); }
// Obviously, it looks best with exceptions, which work everywhere :)
double x = doc["x"];
double y = doc["y"];

The only place that I feel still works better in C++17 is iterating over an object, but honestly that doesn't look all that bad in C++11 either:

// Object iteration in C++11
for (auto field : obj) {
  cout << field.key << ": " << field.value << endl;
}
// Object iteration with C++17 structured bindings
for (auto [key, value] : obj) {
  cout << key << ": " << value << endl;
}

I think the object iteration is such a niche thing that it's worth using the C++11 syntax to buy compatibility (especially since it's still pretty good).

Food for thought. RcppSimdJson started out with C++17 only because ... well you guys did use it. Now that the reach is so my wider I might do different things.

Turns out there aren't many auto [x, error]'s. Just for a flavor, this code would look like:

            for (auto&& object : array) {
                simdjson::dom::element element;
                if (object[key].get(element) == simdjson::SUCCESS) {
                    out[i_row] = get_scalar<scalar_T, R_Type, NO_NULLS>(element);
                }
                i_row++;
            }

For that last one, actually, if you make get_scalar() capable of taking a simdjson_result, you could make it downright nice looking (assuming you're OK with assigning a default to the element if it doesn't convert):

            for (auto&& object : array) {
                out[i_row] = get_scalar<scalar_T, R_Type, NO_NULLS>(object[key]);
                i_row++;
            }

And this code, with an object iteration, can be directly translated thus:

        simdjson::dom::object object;
        if (element.get(object) == simdjson::SUCCESS) {
            for (auto&& field : object) {
                if (cols.schema.find(field.key) == std::end(cols.schema)) {
                    cols.schema[field.key] = Column<type_policy>{col_index++, Type_Doctor<type_policy>()};
                }
                cols.schema[key].schema.add_element(value);
            }

        } else {
            return std::nullopt;
        }
    }

(at_key will work too, I just tend to use [] for objects :)

The C++17 pieces are less about simdjson and more about dealing with the necessary type dynamism.

It currently leans on if constexpr quite a bit to a) limit what seemed like gratuitous code duplication and b) maximize the extreme flexibility it currently offers without compromising any performance. But, that doesn't mean it can't be done.

I haven't measured it formally, but I wouldn't be surprised if simdjson's parsing takes < 10% of the total time. The bottleneck is cramming the results into a limited menu of R objects.

Agreed. And not a worry for today --- today is 0.1.0 day to get it in front of hungry windows users!

Anything else to take care of? Otherwise I'll wrap up and ship.

Good on my end!

For what it's worth, RcppSimdJson as is (using C++17) builds and passes on GCC 7+ and Clang 6+, which are what simdjson itself supports (at least last I looked). I think that means we hit every compiler that both R and simdjson can currently use (no idea about Intel and I'm guessing there isn't some secret effort to get R to play along with Visual Studio).

But now I am curious what it would take... πŸ€”

On CRAN now: https://cran.r-project.org/package=RcppSimdJson

Will take a day or two or three for Windows binaries to be built.

It doesn't have all the bells and whistles of the other JSON packages, but it's definitely a beast.

install.packages("RcppSimdJson") # πŸ”₯πŸš€

file <- tempfile(fileext = ".json")
download.file("https://github.com/zemirco/sf-city-lots-json/raw/master/citylots.json", destfile = file)
string <- readChar(file, nchars = file.size(file))

bench::mark(
    RcppSimdJson::fparse(string),
    jsonify::from_json(string),
    jsonlite::fromJSON(string)
    ,
    check = FALSE,
    filter_gc = FALSE
)
#> # A tibble: 3 x 6
#>   expression                        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 RcppSimdJson::fparse(string) 868.87ms 868.87ms    1.15      58.6MB    0    
#> 2 jsonify::from_json(string)      7.66s    7.66s    0.131    104.5MB    0.653
#> 3 jsonlite::fromJSON(string)     36.15s   36.15s    0.0277     369MB    1.36

bench::mark(
    RcppSimdJson::fload(file),
    jsonify::from_json(file),
    jsonlite::fromJSON(file)
    ,
    check = FALSE,
    filter_gc = FALSE
)
#> # A tibble: 3 x 6
#>   expression                     min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 RcppSimdJson::fload(file) 816.47ms 816.47ms    1.22      57.7MB    0    
#> 2 jsonify::from_json(file)     7.16s    7.16s    0.140    104.5MB    0.838
#> 3 jsonlite::fromJSON(file)    38.48s   38.48s    0.0260   550.1MB    1.53
commented

Nice work @knapply !