Multi-threaded dataloading with tch-rs

Question

Multi-threaded dataloading with tch-rs

AzHicham opened this issue a year ago · comments

Hello,

Thank you for your awesome work !!!
As you may know there is no dataloader feature in tch-rs, it could be really cool to have it with ai-dataloader and maybe with multi-threading handling in further steps.

Thank you :)

Tudyx · Answer 1 · Tue Mar 07 2023 02:54:18 GMT+0800 (China Standard Time)

Hello, thanks for your comment! I'm currently working on the single threaded version, this should be available soon

Tudyx · Answer 2 · Fri Mar 17 2023 07:23:53 GMT+0800 (China Standard Time)

Mono-threaded tch-rs integration is now available in version 0.4.0 🎉

Nishit Patel · Answer 3 · Tue Apr 18 2023 20:47:18 GMT+0800 (China Standard Time)

Hello,

Thank you for your excellent work.

Any idea by what time we can have a data loader with multi-threaded parallelism support?

Thank you

Tudyx · Answer 4 · Wed Apr 19 2023 05:14:42 GMT+0800 (China Standard Time)

I'm currently benchmarking the mono-threaded version against the Pytorch dataloader, multi-thread is definitely the next stop for me. As I'm doing this on my spare times, I can't promise when it will be released, though.

Hicham Azimani · Answer 5 · Tue May 16 2023 03:57:11 GMT+0800 (China Standard Time)

Hello :)
FYI I’m using I did some benchmark on a company project with the single threaded version vs PyTorch and ai-dataloader is 2x faster than PyTorch.
Not sure why yet (it involves http call, png to image conversion + tch-rs) but I’m pretty sure I can improve the speed up with multi threading.
I’m currently trying to implement a multithreaded version with rayon but it’s a little bit complexe ^^
If you have any idea about how to do that let me know :)

Tudyx · Answer 6 · Tue May 16 2023 04:50:35 GMT+0800 (China Standard Time)

Hello ;) I've also observed 2x speedup against PyTorch in my benchmarks, which are available here.

I'm also sure multi-threaded will improve the speed up, as Rust won't have the same limitation of the Python GIL

I’m currently trying to implement a multithreaded version with rayon but it’s a little bit complexe ^^

Totally agree, I haven't found that much documentation on the subject, other than this tutorial which seems pretty good.

Tudyx · Answer 7 · Tue May 16 2023 05:00:53 GMT+0800 (China Standard Time)

Another approach could also be to inspire from the burn parallel datalaoder

Hicham Azimani · Answer 8 · Tue May 16 2023 19:32:23 GMT+0800 (China Standard Time)

Nice I'll look into that.
Another quickwin in this line I simply used par_iter instead of iter with RAYON_NUM_THREADS=4 And the dataloading took around x4 less time
Obviously this is not optimal (no prefetching etc) but a really good first step

Tudyx · Answer 9 · Wed May 17 2023 06:11:33 GMT+0800 (China Standard Time)

Nice founding! x4 time speedup could justify adding this solution as an MVP

Tudyx · Answer 10 · Tue May 23 2023 02:52:52 GMT+0800 (China Standard Time)

@AzHicham I've implemented your quick-win in 253f018 . I think to find optimal parallelization it will need more maturation and benchmarking but that's a first step.

Hicham Azimani · Answer 11 · Tue Jul 18 2023 18:49:06 GMT+0800 (China Standard Time)

Hello,

FYI my implementation is slightly different, and I'm not sure your implementation might use multiple thread. In fact using install from rayon will just set a max number of thread in a pool that can be used for a parrallel operation but the for loop is not running in parrallel mode.

Also what I'm trying now is to add prefetching. This way each time next() iteration of the dataloader is called, the fetching might already be done.
To achieve that I was thinking using a fixed size queue between ProcessDataLoaderIter and the BatchIterator. WDYT ?

IMO there is 3 way to achieve multithreading.
Either we have N threads, and each thread works on a batch. -> Burn approach
Or N threads work on the same batch before processing the next one
Obviously we can have a combination of both ^^

Tudyx · Answer 12 · Sun Jul 30 2023 14:52:27 GMT+0800 (China Standard Time)

Hello @AzHicham , thanks for your comments.

FYI my implementation is slightly different, and I'm not sure your implementation might use multiple thread. In fact using install from rayon will just set a max number of thread in a pool that can be used for a parrallel operation but the for loop is not running in parrallel mode.

Good catch! I think I will keep the install to be able to setup the number of threads but I will use rayon primitive inside of it to make sure parallelism is used.

Also what I'm trying now is to add prefetching. This way each time next() iteration of the dataloader is called, the fetching might already be done.
To achieve that I was thinking using a fixed size queue between ProcessDataLoaderIter and the BatchIterator. WDYT ?

I think it's a great idea! Prefetching is definitely something I wanted to add and any works on this is welcome. Using a fixed-sized queue seems to be fine, I will need to take a closer look to PyTorch implementation to give better insight.

Hicham Azimani · Answer 13 · Sun Jul 30 2023 17:14:59 GMT+0800 (China Standard Time)

Hello @Tudyx

Nice :)

FYI, I have a first working version with prefetching here

It works well (tests OK) but the implementation really suck. I'm still struggling with some traits & static lifetimes ^^

I keep you in touch

Tudyx · Answer 14 · Sun Aug 13 2023 19:16:44 GMT+0800 (China Standard Time)

The multithreaded version should be fixed by b7035e2 , thanks again for spotting the issue.