Asynchronous batch write data types

Question

Asynchronous batch write data types

kylebernhardy opened this issue 4 years ago · comments

Hi,

Looking at the documentation for async batch writes it looks like the values are forced to be buffers. Is there anyway to use string data types for the values? Out of curiosity can non-string (Buffer/Number) data types be used for keys?

Thanks for creating & maintaining an awesome library!

Kris Zyp · Answer 1 · Thu Nov 19 2020 09:04:04 GMT+0800 (China Standard Time)

Is there anyway to use string data types for the values?

Yes, this would be possible to add support for. However, there are a couple things to consider with this:
First, node-lmdb's getString/putString/getCurrentString/etc. methods use UTF-16 for encoding/decoding strings. Unless you are using a lot of non-latin unicode characters, I would recommend against storing strings in UTF-16, as it is much less efficient than UTF-8 for space and is usually less performant as well. NodeJS/V8 tries very hard to internally store strings in "one-byte" Latin format whenever possible, and will only store strings in "two-byte" UTF-16 as needed, consequently encoding strings as UTF-16 often involves re-encoding into a longer representation. Furthermore, decoding UTF-16 will often "force" V8 to use a "two-byte" representation instead of the most optimum string format, which then can negatively impact all downstream JS code that interacts with those strings. UTF-16 encoding/decoding often performs fine in micro-benchmarking, but there are lot of negative side effects from using it. Therefore, if you care about about performance/efficiency (I assume that is important to HarperDB devs), unless your data really involves a lot of non-latin characters, UTF-8 is generally a better general purpose encoding and works better with V8's string representations.

And I actually already have implemented support for handling and storing string values in batchWrite as UTF-8 in lmdb-store (which is kind of mix of branch of node-lmdb and layer on top), so I could certainly backport that to node-lmdb. However, if you are going to be storing UTF-8 strings, I assume I should probably also port over the accompanying getUtf8, putUtf8, etc. methods (I have thought about doing that, but wasn't sure if the API expansion was warranted).

Another consideration is that string encoding is actually a fairly expensive operation regardless of whether it is UTF-8 or UTF-16, and passing a large array of operations with string values to batchWrite can result in a relatively long running function call. It is important to note, that the interaction with the strings (encoding them), must be done on the main thread too, this can not be offloaded to the worker thread that performs the transaction.

Of course the encoding has to happen sometime though. But my intention for how batchWrite was to be used, was that many operations could be batched together from different actions in an application or db, and then (using a timer or whatever) submitted in a single batchWrite call. If each of these individual operations takes the overhead of encoding strings to buffers, then this greatly spreads out the cost of the encoding over many separate operations (much better/smooth interleaving) rather than concentrating all that work in a single blocking batchWrite function call.

Furthermore, if you are potentially storing a lot of smaller string values, there are additional optimizations that can be done for conversion to buffers (using pure JS when possible) that are well beyond the scope of node-lmdb.

Out of curiosity can non-string (Buffer/Number) data types be used for keys?

Yes, you should be able to use buffers and numbers as keys for data submitted to batchWrite. One caveat is that it "infers" the key type on the first entry, so they all need to be the same type (difficult to mix key types anyway with node-lmdb though).

Anyway, let me know if you do want me to add support for string values in batchWrite (and what form).

Kyle Bernhardy · Answer 2 · Tue Nov 24 2020 01:46:24 GMT+0800 (China Standard Time)

Thank you so much for the detailed response. We are overall performing a lot small writes where the value is a string, in most cases relatively small strings. In most cases the value would be UTF-8, except for the value of our primary index, we have no guarantee all data elements will only be UTF-8. Do you know what the performance on scale is like when the value is a buffer and we need to marshal that back to a string in node-lmdb?

Kris Zyp · Answer 3 · Tue Nov 24 2020 11:01:42 GMT+0800 (China Standard Time)

So are you asking about the performance difference doing a get from the database and decoding with UTF-16 (as node-lmdb's getString does) vs decoding with UTF-8? I'd say the difference in these two decoding methods is pretty small (and go either way depending on presence of non-latin characters), but the differences in memory usage are more obvious (which eventually have more pernicious effects on overall performance).

Depending on usage, generally the fastest possible get method is to have get copy its data directly into a pre-existing and reused buffer (creating new buffers is expensive), and for smaller strings it might actually be slightly faster to use pure JS decoding on such a buffer (there is an overhead to constructing strings in C as well), but these differences are probably small if you are just working with strings, the standard getString is pretty fast (the performance advantages of working with buffers in JS are more significant when deserializing to objects).

I'm not sure I understand how you are writing (or intending to write) the string values; you don't know how the strings will be encoded ahead of time (whether you will encode them with UTF-8, UTF-16, or something else)? Are these the type of strings you might want to write with batchWrite? Or are you mainly deal with buffers, that have already been encoded by users (and you don't necessarily know how they encoded the string to the buffer)?

Kyle Bernhardy · Answer 4 · Wed Nov 25 2020 00:22:21 GMT+0800 (China Standard Time)

I apologize for the lack of clarity, the encoding is UTF-8 always for us. Our data is stored where our primary key is a non-dupsort dbi where the key is the unique id for the row & the value is the stringified JSON object, currently key & value are both strings. All subsequent dbis are dupsorted where the key is either a string or buffer (we are using buffers for numbers) & the value is a string. These dbis are all indices which allow us to search on value & get the primary key. The data size of the key and the data size of the value in these dbis are always <=250 bytes. What you are helping me with & what I'm curious about is if storing raw strings is less optimal or no different from storing buffers.

This issue: #170 seems to imply that buffers do incur a performance penalty over strings. It looks like your lmdb-store library handles key sorting, i will take a look at that as it might help with a few updates I am working on regarding performance. Given the data model discussed above do you feel lmdb-store would be a good fit? What is your opinion about mixing using lmdb-store with raw node-lmdb in a project. Since the first is built on the latter I don't see an issue but was curious if you could see an issue. I don't have a specific use case in mind, but want to see if there is any obvious case where I could shoot myself in the foot.

I do have questions regarding lmdb-store, is it more helpful to ask those on that github repo?

Kris Zyp · Answer 5 · Wed Nov 25 2020 11:30:16 GMT+0800 (China Standard Time)

#170 seems to imply that buffers do incur a performance penalty over strings

Yes, using constructing a new buffer on each get definitely has a performance penalty, and I was actually surprised to discover the significance of the cost of new buffers in #170 (I had previously thought that getBinaryUnsafe would be fastest, but clearly not the case). As a result (as I mentioned before), the performance optimization I have used to provide fast access to binary data (in lmdb-store) is to continually reuse a single buffer and copy data into it (then decode it). However, getString (and getUtf8 if I copy it over to node-lmdb) are pretty fast, since they skip buffer creation, and getUtf8() is definitely a lot faster than (probably twice as fast) as getBinary().toString('utf8'). But for the entire deserialization process, deserializing data with msgpack from (reused) buffers, (where only actual data that will end up in string form is decoded), has proven to be the fastest in my experience.

Given the data model discussed above do you feel lmdb-store would be a good fit?

Yes, your database setup sounds similar to how we store data in our applications. The main difference is I've never used dupsort, but I can certainly test that out (in our application we have constructed indices with an tuple key of index key + source key that are always unique)

What is your opinion about mixing using lmdb-store with raw node-lmdb in a project.

Yes, you can do this. lmdb-store is intended to be kind of a higher-level interface on top of node-lmdb, and has node-lmdb embedded in it and you can use its "raw" API freely and interchangeably. lmdb-store embeds (instead of depends on) node-lmdb because there are a number of higher level pieces of functionality that involved native C++ code for optimal efficiency, including buffer reuse (as described above), off-thread compression, native key translation with UTF-8 and mixed types, that I wasn't sure if really belonged in node-lmdb and therefore node-lmdb is actually forked and embedded in lmdb-store so that C++ extensions are directly a part of the compiled native code. Consequently, yes the node-lmdb API is embedded and available in lmdb-store (with some minor changes; getBinaryUnsafe is used for buffer reuse instead of pointing at shared memory), and yes, you can freely interact with a given database through lmdb-store API and the node-lmdb API. LMDB itself impose constraints that would prevent you from separately opening the same Env twice so can't have both packages installed separately, but again, that wouldn't be necessary to continue to use the embedded node-lmdb API.

Also, If you believe any of these additional pieces of functionality should be moved into node-lmdb itself, I would certainly be glad to do it, I certainly have aimed to help keep node-lmdb up-to-date as a high-quality low-level interface as well. I have only omitted functionality from node-lmdb when it seemed to be a little higher level or more opiniated than the spirit of node-lmdb.

I do have questions regarding lmdb-store, is it more helpful to ask those on that github repo?

Sure, that might make sense for more specific questions.

Kyle Bernhardy · Answer 6 · Thu Nov 26 2020 00:32:28 GMT+0800 (China Standard Time)

This is all extremely helpful. I have been reviewing the code on lmdb-store and a lot of what is implemented there would be very helpful for us, especially your implementation of the ordered binary key & msgpackr for values with the reused buffer. The real limiting factor may be dupsorting, but since I can also use the embedded enhanced node-lmdb I may primarily use lmdb-store to access those features. I agree with your delineation of projects in keeping node-lmdb a simplified 1 to 1 of the C LMDB library and adding higher order features elsewhere. I will move the rest of this conversation over to lmdb-store. Thank you again so much!