cruise-automation / rosbag.js

ROS bag file reader for JavaScript πŸ‘œ

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature Suggestion: Allow decompression functions to be asynchronous

gkjohnson opened this issue Β· comments

A few of the rosbags I'm working with are lz4 compressed and I'm noticing that my frame time is often dominated by the decompression function when reading the file leading to hiccups in application responsiveness.

It would be great to be able to run the decompression asynchronously on a web worker -- using ArrayBuffers and SharedArrayBuffers it should be possible to win pretty easily on decompression time.

Is it as easy as awaiting a promise when calling the compression function here? The other question is then would it be safe to transfer the ArrayBuffer ownership temporarily to a webworker while decompression work happens? Or is it expected that multiple chunks share an array buffer? It looks like the answer might be no in the browser but it's unclear when running in node. If they are separate then it would be possible to decompress multiple chunks in parallel, too.

To advise more about your situation in particular I'd need to understand how your app is structured and what a "frame" is for you. But here's some info about the approach we take in Webviz:

The whole readMessages call happens in a worker. We pass noParse: true: https://github.com/cruise-automation/webviz/blob/9528e0a25c99a4c471563a8dc6aef0083db663f2/packages/webviz-core/src/players/BagDataProvider.js#L122

This results in raw buffers which can be transferred back to the main thread, which is implemented using our @cruise-automation/rpc package:
https://github.com/cruise-automation/webviz/blob/9528e0a25c99a4c471563a8dc6aef0083db663f2/packages/webviz-core/src/players/RpcDataProvider.js#L54-L56

We parse the messages back on the main thread in a separate step: https://github.com/cruise-automation/webviz/blob/9528e0a25c99a4c471563a8dc6aef0083db663f2/packages/webviz-core/src/players/ParseMessagesDataProvider.js#L47

The pieces fit together as described here:
https://github.com/cruise-automation/webviz/blob/9528e0a25c99a4c471563a8dc6aef0083db663f2/packages/webviz-core/src/players/standardDataProviderDescriptors.js#L16-L26

All of this works in our application because of the "read ahead" layer on top, which will read extra messages asynchronously to fill up a queue that the main thread can pull from at regular intervals.

I'd need to understand how your app is structured and what a "frame" is for you.

We're using rosbag.js on the main thread and buffering messages ahead of time, as well. When we increment the time we check if new messages should be buffered and perform the read. A "frame" here is basically just a browser display frame and when we have to buffer messages on one it can take noticeably longer because of the decompression.

The whole readMessages call happens in a worker. We pass noParse: true

This results in raw buffers which can be transferred back to the main thread

We parse the messages back on the main thread in a separate step

I see so you're reading out chunk buffers on another worker with read and noParse: true and then parsing them on the ui thread with read again -- I didn't realize noParse could be used like that. Is there anything else that's being gained by running rosbagjs itself in a worker? I'm less familiar with the inner workings of the library but from profiling it looks like the only other expensive work that's happening other than decompression is message parsing, which it sounds like you're doing on the main thread, anyway.

At some point in the past, we did the message parsing in the worker as well. I believe @brianc measured and found that the cost of the postMessage / structured clone of a parsed message led to slower performance than transferring ArrayBuffers (which is basically free) + parsing on the main thread.

To answer part of your original question, to my knowledge the chunks will all have separate buffers.

structured clone of a parsed message led to slower performance

This is definitely my experience, as well -- structured clone is extremely slow especially on large objects.

So it sounds like really only the decompression is happening on worker at the moment. To me it seemed more intuitive to just run the decompression in a worker by enabling the decompression functions to be async instead of running one rosbag instance in a worker and creating another in the ui thread. I think there could be performance benefits here, too, in that you could decompress multiple ArrayBuffers simultaneously, right? Eg three chunks that each take 20ms to decompress would take 60ms when running all together in one worker but 20ms if you can run them all at the same time. Even with a little worker overhead that could still be a win.

Thanks again for your explanations!

So it sounds like really only the decompression is happening on worker at the moment.

There are other costs which are paid on the worker thread, such as actually reading from disk (or the network β€” although these are always async in the browser anyway) and the parsing of connections, chunk metadata, and individual record buffers out of the chunk.

I think there could be performance benefits here, too, in that you could decompress multiple ArrayBuffers simultaneously, right?

Yes, it still sounds like you're onto something; it could be a nice improvement to the BagReader API if decompression were allowed to be async. You might need to be careful with the _lastReadResult bookkeeping in the BagReader (but since the _file.read() is already async, it might not actually be a big deal). If you're interested in exploring this I'm curious to hear how it works out for you, and would happily review a PR :)

I can definitely look into a PR if it looks promising -- I just wanted to make sure this was an agreeable change, first.

I hope you don't mind if bounce a few thoughts around and ask a few more clarifying questions! I'm still wrapping my head around the structure of the code a bit.

1.

There are other costs which are paid on the worker thread, such as actually reading from disk ... parsing of connections, chunk metadata,

Just to be clear the parsing of connections and chunk metadata is a one-time cost up front, right? Not that it's not worth saving the time but I assumed that these were comparatively cheap, too. I haven't actually profiled it, though, so maybe that's not the case. I imagine the reading of buffers is relatively cheap, as well.

2.

I see that all file reads, decompression, and parsing is awaited in BagReader, meaning that it all happens sequentially so there's no way for the file reads or http requests to happen in parallel. Is there a reason that all calls to readChunkMessagesAsync aren't kicked off immediately and then awaited using Promise.all? In order to see overall time benefits from async decompression these calls would have to be run in parallel (or at least some of them would be).

3.

You might need to be careful with the _lastReadResult bookkeeping in the BagReader

I see that this is basically an optimization that caches that last result so it can be returned quickly. Do you know if / how often this cache is used in practice in your application? Is it important that this be set to the result of the last chunk being processed in the sequence, for example? If the aforementioned change is made to the way readChunkMessagesAsync is called then it's possible that the last chunk to finish would be the first one in the sequence being parsed.

Just to be clear the parsing of connections and chunk metadata is a one-time cost up front, right?

I'm not sure what you mean by one-time β€” if you are intending to read the whole bag, then in some ways everything is a one-time cost. It's true that there is probably an order of magnitude less connections / chunk metadata than there are actual messages, so it might not be a big deal. Your measurements are certainly more accurate than my speculation πŸ˜„ (But as I also mentioned, the individual message records still need to be parsed in order to find the indexes of the message buffers; this happens in readChunkMessages.)

Is there a reason that all calls to readChunkMessagesAsync aren't kicked off immediately and then awaited using Promise.all?

We regularly work with bag files >1GB in size and don't want to hold all that in memory at once.

I see that this is basically an optimization that caches that last result so it can be returned quickly. Do you know if / how often this cache is used in practice in your application?

I wasn't very familiar with this part of the code, but I think you're right; if you want to drop the ordering assumption then you could easily replace it some slightly fancier memoization (if the optimization is even necessary at all). I'm not sure how often it is used in practice for us, but you could try with the code at https://github.com/cruise-automation/webviz and add some instrumentation if you're interested. (I would think it was originally added because this was a hot path, but @brianc would be able to confirm.)

I'm not sure what you mean by one-time β€” if you are intending to read the whole bag, then in some ways everything is a one-time cost.

Fair I suppose I just mean that it happens once up front when initializing the bag while message parsing will happen when playing back the data which makes it's particularly performance sensitive. I understand what you're saying, though.

We regularly work with bag files >1GB in size and don't want to hold all that in memory at once.

I see I'll keep that in mind. The way we're using the bag we read messages in 1-5 second batches as needed and cache the them until they're no longer needed so there's never a time when all the bag data would be in memory at once for us.

When I can I'll try a few things out and propose some more specific changes that work for us. Thanks!

Ah, yeah, I had forgotten that at this point in the code, the chunkInfos are already filtered to a particular requested time range:

rosbag.js/src/bag.js

Lines 111 to 113 in a9f6ff7

for (let i = 0; i < chunkInfos.length; i++) {
const info = chunkInfos[i];
const messages = await this.reader.readChunkMessagesAsync(

If the number of chunks is small, it definitely would make sense to allow decompressing (even parsing) them in parallel. It's just that, theoretically, you could write a program to process the whole bag with a single bagReader.readMessages((message) => ...) and in that case you probably wouldn't want it to go start decompressing all the chunks when it's going to return you the results message-by-message anyway.

Back to this question:

I see that this is basically an optimization that caches that last result so it can be returned quickly. Do you know if / how often this cache is used in practice in your application?

Now that I'm recalling a little more how this all fits together β€” I'm almost certain it's a hot path because we load data in by making many readMessages() calls for small startTime/endTime ranges during playback (preloading by about 100ms, configured here). Since these time ranges are almost always adjacent, it's very common that a read happens in the same chunk as the previous read.

(Note that it's technically possible for multiple chunks in a bag to cover the same time range, i.e. the messages can be "out of order" in the file; rosbag.js doesn't currently this gracefully/correctly. The necessary logic is implemented here in rosbag_storage by, somewhat surprisingly, sorting a set of connection iterators after each message read.)

By the way, how are you doing LZ4 decompression? In Webviz we hook up wasm-lz4 as the decompression function, which is a big perf win over a pure JS implementation.

Since these time ranges are almost always adjacent, it's very common that a read happens in the same chunk as the previous read.

That's kind of what I thought it might be, as well -- I'll try to keep the last sequential chunk cached.

Note that it's technically possible for multiple chunks in a bag to cover the same time range, i.e. the messages can be "out of order" in the file; rosbag.js doesn't currently this gracefully/correctly.

Oh interesting thanks for the notice -- is this documented anywhere? What types of problems should I expect to see as a result of this?

In Webviz we hook up wasm-lz4 as the decompression function, which is a big perf win over a pure JS implementation.

I'm just using the lz4js package as referenced in the repo -- I'll give your wasm version a try, too.

Thanks!

Hi! Sorry I'm a bit late to the party here - I'm one of the original authors on rosbag.js...lemme see if there's anything that's not gone answered yet & add my 2 cents on things.

I see that this is basically an optimization that caches that last result so it can be returned quickly. Do you know if / how often this cache is used in practice in your application?

As Jacob said it's heavily used in our application - typically reading the same 'chunk' over and over and over in little reads.

wasm-lz4 is quite a bit faster than lz4js. I'd recommend using it in performance critical applications, though it's might not be entirely intuitive on how to get it working. Feel free to open an issue over there if you find the docs are lacking & I'll try to touch them up.

I'm curious as to how you're thinking of doing work in parallel exactly? Loading multiple webworkers? Each one handling a decompression call in a round-robin fashion? Decompression is entirely CPU bound & you'll likely see some perf gains by decompressing a few chunks at once, provided each decompression work is delegated to its own separate web worker, but likely diminishing returns after that...I'm not sure exactly how the browser decides to run what work on what OS thread in a worker but I saw performance taper off quite a bit after about 4 workers on my machine when doing some other work in workers. In webviz we ended up having adequate performance by using wasm-lz4 and pushing most of the bag reading/decompressing work into a single web worker...after that, bag reading wasn't the slowest thing in the app so perf priorities shifted.

Reading more chunks in parallel makes sense so long as there's a reasonably low upper-bound of "pre-reading" so the entire bag isn't read into memory. I think the work involved in this lib is likely supporting an optional async decompression call (while maintaining the sync version as well for backwards compatibility) and then having the BagReader have some concept of "I've gone ahead and decompressed the next n chunks in the bag so I'll use those decompressed chunks if I have them when fetching ranges from them". Then kick out chunks while you fetch new ones in an LRU style if there's a cache miss?

Then the actual work of parallelizing decompression across multiple web workers can be done in a separate library...I'd imagine that library having the api of an async decompression function & doing the fan out to workers and stuff. I've used promise-queue for work like this w/ good results.

Hope this helps! Thanks for all your thoughts on the subject. πŸ˜„

I'm curious as to how you're thinking of doing work in parallel exactly? Loading multiple webworkers? Each one handling a decompression call in a round-robin fashion?

Yup! Creating a pool of workers to pick tasks off of a queue has worked pretty well for us in the past for array buffer processing. MDN mentions that navigator.hardwareConcurrency can be used to see how many workers can be run at once so if you try run more than the browser reports you'll likely see diminishing returns (my laptop is reporting 4). My impression is that depending on how quickly the worker tasks can complete it might still be worth creating more so that another worker can pick up immediately once one has finished.

Reading more chunks in parallel makes sense so long as there's a reasonably low upper-bound of "pre-reading" so the entire bag isn't read into memory.

My plan is to add an option that specifies the maximum amount of bytes that are allowed to be loaded at once and processing will wait until a chunk finishes before carrying forward with the assumption that at least one chunk should be processing at all times. Max bytes can default to zero to replicate the current 1-chunk-at-a-time behavior. There's a corner case that I'll ask for your opinion on after I do a little more testing and can submit a draft.

the actual work of parallelizing decompression across multiple web workers can be done in a separate library

I agree!

Quick update

It looks like there are cases where a larger underlying array buffer is shared among many chunks meaning it can't efficiently be transferred to the worker. To avoid going down too many rabbit holes I'm going to table this for now and look into wasm-lz4.

@brianc @jtbandes

Before I forget I wanted to follow up on this again because it seems like the kind of thing that could bite us if we're not sure what to look for:

Note that it's technically possible for multiple chunks in a bag to cover the same time range, i.e. the messages can be "out of order" in the file; rosbag.js doesn't currently this gracefully/correctly.

Oh interesting thanks for the notice -- is this documented anywhere? What types of problems should I expect to see as a result of this?

Thanks!

The thing to look for would be bag files where the chunks overlap in time. This would likely only happen if you created the bag file "manually" with a script that adds messages to the bag such that their timestamps are out of order (and even then, the chunk size would have to be small enough that earlier messages written later are actually written into a new chunk). A rosbag record would write them with the timestamps they're received at, which should be monotonic so it wouldn't experience this problem.

The issue amounts to the fact that during readMessages we just iterate over matching chunks in order:

rosbag.js/src/bag.js

Lines 111 to 121 in bc264c7

for (let i = 0; i < chunkInfos.length; i++) {
const info = chunkInfos[i];
const messages = await this.reader.readChunkMessagesAsync(
info,
filteredConnections,
startTime,
endTime,
decompress
);
messages.forEach((msg) => callback(parseMsg(msg, i)));
}
As I pointed out above, the C++ bag reader handles this by keeping a separate iterator for each chunk and sorting them after each message is emitted. https://github.com/ros/ros_comm/blob/29053c4832229efa7160fb944c05e3bc82e11540/tools/rosbag_storage/src/view.cpp#L151

Should we detect this using the bag index and show a warning or error?